Custom-made archive format question

Custom-made archive format question - filesystems

I was thinking about developing an own file archive format to use for private projects. The thing is that I am not looking for a solution like 7z or RAR, but I want to make something different, similar to a file system.
Looking at real file system, each has two sections in common in its architecture - information about files stored on disk and actual data of the files, as follows:
----------------------------
METADATA | FILE DATA
----------------------------
My question is - how is it possible that these two sections will not overlap? I mean, the FAT STRUCTURE section grows towards the FILE DATA section, while the latter grows towards the end of the disk (partition). How does a file system manage these sections?
This is what I have been trying to figure out for most of the time and any tip would be more than welcome.

Most file systems operate with clusters or pages or blocks, which have fixed size. In many filesystems the directory (metadata) is a just a special file, so it can grow in the same way the regular data files grow. On other filesystems some master metadata block has a fixed size which is pre-allocated during file system formatting. In this case the file system can become full before files take all available space.
On a side note, is there a reason to reinvent the wheel (custom file system for private needs)? There exist some implementations of in-file virtual file systems which are similar to archives, but provide more functionality. One of examples is our SolFS.

All you need is a manifest containing the file list, archive name, and or password and then have all the files listed there
if you can make the files smaller than that's even better!

Related

C - Storing a large group of files as a single resource

Please forgive me if there is a glaringly obvious answer to this question; I haven't found it because I'm not entire sure what I'm looking for. It may well be this duplicates a question I haven't found; sorry.
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly (not that I anticipate it being distributed I'm looking to package my own work for convenience).
In my own opinion it would be more convenient if the file library was stored in a single file that remained accessible to the application for example alongside /usr/bin/APPLICATION or in the most appropriate location; accessed by the executable when required.
I searched for questions similar and found suggestions that indicated two possible options Resource Files which appear to be native to Windows and Including files at compile. The first question leads to an answer similar to the second and doesn't answer the question relating to the existence of resource files for linux executables. It (like the second) looks at including the datafile in the compilation process. This is not so useful as if I only want to update my resources I'm forced to recompile the entire application (the media is dynamically added).
QUESTION: Is there a way to store a variety of file types in one single file accessible to an executable in linux, and if so how would you implement this?
My thoughts on this initially were to create a .zip or .gz file which might also offer compression as an added bonus but I have no idea how (or if it is even possible) to access data within such a file on the fly. I'm equally uncertain if there is a specific file type or library that offers a more suitable solution. Also I know virtually nothing about .dat files could these be used in this context on a linux system?

I do not understand why you would use a single file at all. Considering the added complexity (and increased chance of bugs creeping in) of file extraction and the associated overheads, I do not see how it would be "more convenient".
I have a C executable that uses text, audio, video, icons and a variety of different file types.
So do many other Linux applications. The normal approach, when using package management, is to put the architecture independent data (icons, audio, video, and so on) for application /usr/bin/YOURAPP in /usr/share/YOURAPP/, and architecture dependent data (like helper binaries) in /usr/lib/YOURAPP. It is extremely common for the latter two to be full directory trees, sometimes quite deep and wide.
For locally compiled stuff, it is common to put these in /usr/local/bin/YOURAPP, /usr/local/share/YOURAPP/, and /usr/local/share/YOURAPP/ instead, just to avoid confusing the package manager. (If you check ./configure scripts or read Makefiles, this is the chief purpose of the PREFIX variable they support.)
It is also common for the /usr/bin/YOURAPP to be a simple shell script, setting environment variables, or checking for user-specific overrides (from $HOME/.YOURAPP/), ending up with exec /usr/lib/YOURAPP/YOURAPP.bin [parameters...], which replaces the shell with the actual binary executable without leaving the shell in memory.
As an example, /usr/share/octave/ on my machine contains a total of 138 directories (in a hierarchy of up to 7 directories deep) and 1463 files; about ten megabytes of "stuff" all told. LibreOffice, Eagle, Fritzing, and KiCAD take hundreds of megabytes there each, so Octave is not an extreme example in any way either.

You have several alternatives (TODO: add more ;)):
You can read some archiver file format specifications, writting code to read/write to those archivers, and waste your time doing so.
You can invent a dirty, simple file format, for example ("dsa" stands for "Dirty and Simple Archiver"):
#include <stdint.h>
// Located at the beginning of the file
struct DSAHeader {
char magic[3]; // Shall be (char[]) { 'D', 'S', 'A' }
unsigned char endianness; // The rest of the file is translated according to this field. 0 means little-endian, 1 means big-endian.
unsigned char checksum[16]; // MD5 sum of the whole file. (when calculating checksums, this field is psuedo-filled with zeros).
uint32_t fileCount;
uint32_t stringTableOffset; // A table containing the files' names.
};
// A dsaHeader.fileCount-sized array of DSAInodeHeader follows the DSAHeader.
struct DSANodeHeader {
unsigned char type; // 0 means directory, 1 means regular file.
uint32_t parentOffset; // Pointer to the parent directory, or zero if the node is in the root.
uint32_t offset; // The node's type-dependent header starts here.
uint32_t nodeSize; // In bytes for files, and in number of entries for directories.
uint32_t dataOffset; // The file's data starts at this offset for files, and a pointer to the first DSADirectoryEntryHeader for directories.
uint32_t filenameOffset; // Relative to the string table.
};
typedef uint32_t DSADirectoryEntryHeader; // Offset to the entry's DSANodeHeader
The "string table" is a contiguous sequence of null-terminated character strings.
This format is greatly simple (and portable ;)). And, as a bonus, if you want (de)compression, you can use something like Zip, BZ2, or XZ to (de)compress your file (those programs/formats are archiver-agnostic, i.e, not dependent on tar, as commonly believed).
As last last (or first?) resort, you may use an existent library/API for manipulating archivers and compressed file formats.
Edit: Added support for directories :).

I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly.
Considering the added complexity of associated differrent file types alongwith folder structure large and deep and required installed with application. Adding a single resources file would be difficult or would say near to immpossible to trace changes in case if you want to change resources dynamically. Certainly, adding resources to executable file is not an option as it will be increase the size of executable file and needed frequent re-complation in case of update of resources.
After giving consideration on all aspects of your project it seems to me the solution would be using INI file. INI would be stored at definate location and other resources location should be prived in INI File. As with INI you can store the locations of resources, hash keys and sizes easily and would easy check the changes or update the resources.
Since you are using already compressed versions of File type and thus General Zipping algos would not work as the rate would be very low. Thus recommend to use 7z algos for compression. From various algo I would suggest to opt of xz zipping algo as it is currently used by many opensource project to compress the binaries and decrease the size.
Foreach file compression its crc32 or hash value should also included in INI file to check the validity of data transfered.

Lets say you have:
top-level-folder/
|
- your-linux-executable
- icon-files-folder/
- image-files-folder/
- other-folders/
- other-files
Do this (inside top-level-folder)
tar zcvf my-package.tgz top-level-folder
To expand, do this:
tar zxvf my-package.tgz

FAT System Identification of free space and structure of entry files?

Been seaching google for a good explanation for how FAT systems identify free space and the structure of FAT Entry files.
Alot of the explanations ive found are quite hard to follow can anyone help brief sum these up?
i understand that clusters are marked as unused but is this within the root directory or data region? and is the information on clusters status just marked in a table?
I haven't managed to gain any knowledge on the structure of the entry files either, just that they use chains to keep the clusters together
Anyone help?

A file system can be thought of having three (3) types of data: file data, file meta-data and file system meta-data. File data is file or directory contents. File meta-data is that which tells us where the file data is stored on the disk. File system meta-data tells us how the file system allocates the blocks used in the file system.
The FAT file system however does not keep the lines so clear cut. Its disk structures often blur these distinctions.
The File Allocation Table (FAT) itself blurs the lines of the file meta-data and file system meta-data. That is, the FAT entries identify both the cluster number of the where the next cluster of file (or directory) data can be found as well as indicating to the file system whether the cluster identified by the index into the FAT is available (or not). As you indicated in your question, this forms a chain. A special marker (the specific value escapes my memory) indicates that the cluster identified by the index into the FAT is the last cluster in the chain.
Directory entries in a FAT based file system are both file data and file meta-data. They read like files with their entries being the "file data". However, their entries are also interpreted as file meta-data, for they contain the file attributes (permissions, file size, and the starting cluster number--which is an index into the FAT).
The root directory is a special directory on a FAT file system. If memory serves, it does not have either a "." nor a ".." entry. On FAT12 and FAT16 systems, the size of the root directory is specified when the disk is formatted and is thus of fixed size--however, its clusters are still marked in the FAT. On FAT32, the root directory size is not set at format time and can grow. The starting cluster of the root directory is stored in a special field in one of the file system meta-data structures (as I'm going by memory the name of this structure eludes me).
Hope this helps.

Here is a fairly long article that has lots of information about fat file systems.
It should provide all the details you need.
http://en.wikipedia.org/wiki/File_Allocation_Table

Changing inode behaviour

I am trying to modify the ext3 file system. Basically I want to ensure that the inode for a file is saved in the same (or adjacent) block as the file that it stores metadata for. Hopefully this should help disk access performance
I grabbed the kernel source, compiled it, read a bunch about inodes and looked the inode.c file in the fs subdirectory. However, I am just not sure how I can ensure that any new file being created, and the inode for this file, can be saved in the same or adjacent blocks. Any help or pointers to further readings would be appreciated. Thanks!

Interesting idea.
I'm not deeply familiar with ext3, but I can give you some general pointers.
Currently ext3 stores inodes in predetermined places. Each block group has its own inode table, an array of inodes. So when you have an inode number (i.e., as the result of looking up a filename in a directory), you can find the corresponding inode on disk by using the inode number first to select the correct block group and then to index into that block group's inode table.
If you want to put the inodes next to the corresponding file data, you'll need a new scheme for finding an inode on disk. If you're willing to dedicate a block for each inode, then one possible scheme would be to allocate a new block every time you need an inode and then use the block number as the inode number. This might have the benefit that for small files you could store the data in that same block.
To make something like this happen, creating a new file (i.e., allocating an inode) would have to work very differently than in the current ext3 file system. Instead of using a bitmap to find an unused, pre-allocated and pre-initialized inode, you would have to allocate an empty block and initialize it yourself. So, you'll probably want to look at how the file system allocates blocks when it's writing to a file, then mimic that for allocating an inode.
An alternative scheme would be to store the inode inside the directory. So you save an I/O not because the inode is next to its data, but because when you lookup the filename you also read the inode. This was done back in the 90s as an experiment in BSD's FFS file system, and was written up in an excellent USENIX Paper. Those ideas never made it into FFS, or into any other main stream file system that I'm aware of, so it might be interesting to see how they work in ext3.
Regardless of whether you pursue one of these schemes or come up with something of your own, you'll also have to modify mke2fs to initialize the file system on disk in a way that your new file system variant will understand.
Good luck! It sounds like a fun project.

Kudos for getting into file system design!
First, a bit of engineering advice before you get too deep into hacking: make a copy of the ext3 tree and rename the file system to something else. I've found that when introducing experimental changes into a file system, you really don't want it to be used for your main system. Your system should still boot even if you introduce a bug that randomly loses files (it will eventually happen). You'll also need to branch the ext3 userspace tools to work with your new system.
Second, go get a copy of Understanding the Linux Kernel, 3 ed. by Bovet and Cesati. It presents an organized view of kernel subsystems, and I've found its explanations to be worthwhile. It's written for an older kernel (2.6.x for some x < 15; I forget exactly), but it's still accurate in many places. Read through its descriptions of file systems. I believe it covers ext3.
Third, about your actual project, you aren't proposing a simple modification to ext3. That file system has a pretty straightforward way of mapping an inode number to a disk block. You'll need to find a new way of doing this mapping. I would not anticipate any changes to the rest of ext3. Solving this challenge may be one of the key design points of your architecture. Note that keeping around a big array of inode -> disk block maps doesn't solve your problem: it's probably no better than existing ext3.

how is a file represented on a disk

so I want to ask, and forgive me if this is obvious, or newbie question:
if I create a file, say a text file - save it, (I'm using Ubuntu), so this file I have created, has some extra information associated with it, such as, the place on my hard drive where it has been saved. How to examine this information? Where does this information get stored for my specific file? How to examine the file as it is stored on my disk, I assume in terms of, what, bytes?
Maybe I need to focus this question,
Thanks,
B

This is the responsibility of your file system. In very brief, a file system is a data structure which is laid out onto your entire disk -- that's what "formatting" a disk does -- and your files are saved into that data structure. There are lots of file systems, and their details vary quite widely. http://www.forensics.nl/filesystems has a whole bunch of papers on file system design and organization. I'd start with McKusick's A Fast File System for UNIX; it's old, but it contains lots of ideas that are still influential today.
You need a filesystem-specific forensics tool if you want to look at the data structures on your disks. Ubuntu's probably using something in the ext2 family, so try debugfs.

I think maybe you do need to focus it a bit :-)
For UNIX file systems, there are many different types.
The one I'm most familiar with (ext2) has a "file" on disk containing directory entries. These entries are simple names and pointers to the file itself (which is why you can have multiple directory entries pointing to the same file, hard links).
The file itself is an inode which contains the properties of the file (owner, size, permissions and so on).
The inode also contains direct and indirect pointers to the contents of the file. By direct, I mean a pointer to a data block.
An indirect pointer is a pointer to a pointer to contents. I believe you can go to another two levels of indirection, which gives you truly massive file sizes:
More details on Wikipedia.

Fastest file access/storage?

I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?
Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.
A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?
Any ideas?
The crux of this is finding a file. What's the fastest way to find a file by name to open?
EDIT:
I have no control over the file system upon which this data is stored. It's going to be NTFS or FAT32.
Storing the file data in a database is not an option.
Files are going to be very small--maximum of probably 1 kb.
The drives are going to be solid state.
Data access is virtually random, but I could probably figure out a priority for each file based on how often it is requested. Some files will be accessed much more than others.
Items will constantly be added, and sometimes deleted.
It would be impractical to consolidate multiple files into single files because there's no logical association between files.
I would love to gather some metrics by running tests on this stuff, but that endeavour could become as consuming as the project itself!
EDIT2:
I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!

This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.
You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.
Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.
The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.

That file algorithm will work, but it's not optimal. I would think that using 2 or 3 character "segments" would be better for performance - especially when you start considering doing backups.
For example:
d:\storage\fo\ob\ar\foobar.txt
or
d:\storage\foo\bar\foobar.txt
There are some benefits to using this sort of algorithm:
No database access is necessary.
Files will be spread out across many directories. If you don't spread them out, you'll hit severe performance problems. (I vaguely recall hearing about someone having issues at ~40,000 files in a single folder, but I'm not confident in that number.)
There's no need to search for a file. You can figure out exactly where a file will be from the file name.
Simplicity. You can very easily port this algorithm to just about any language.
There are some down-sides to this too:
Many directories may lead to slow backups. Imagine doing recursive diffs on these directories.
Scalability. What happens when you run out of disk space and need to add more storage?
Your file names cannot contain spaces.

This depends to a large extent on what file system you are going to store the files on. The capabilities of file systems in dealing with large number of files varies widely.
Your coworker is essentially suggesting the use of a Trie data structure. Using such a directory structure would mean that at each directory level there are only a handful of files/directories to choose from; this could help because as the number of files within a directory increases the time to access one of them does too (the actual time difference depends on the file system type.)
That said, I personally wouldn't go that many levels deep -- three to four levels ought to be enough to give the performance benefits -- most levels after that will probably have very entries (assuming your file names don't follow any particular patterns.)
Also, I would store the file itself with its entire name, this will make it easier to traverse this directory structure manually also, if required.
So, I would store foobar.txt as f/o/o/b/foobar.txt

This highly depends on many factors:
What file system are you using?
How large is each file?
What type of drives are you using?
What are the access patterns?
Accessing files purely at random is really expensive in traditional disks. One significant improvement you can get is to use solid state drive.
If you can reason an access pattern, you might be able to leverage locality of reference to place these files.
Another possible way is to use a database system, and store these files in the database to leverage the system's caching mechanism.
Update:
Given your update, is it possbile you consolidate some files? 1k files are not very efficient to store as file systems (fat32, ntfs) have cluster size and each file will use the cluster size anyway even if it is smaller than the cluster size. There is usually a limit on the number of files in each folder, with performance concerns. You can do a simple benchmark by putting as many as 10k files in a folder to see how much performance degrades.
If you are set to use the trie structure, I would suggest survey the distribution of file names and then break them into different folders based on the distribution.

First of all, the file size is very small. Any File System will eat something like at least 4 times more space. I mean any file on disk will occupy 4kb for 1kb file. Especially on SSD disks, the 4kb sector will be the norm.
So you have to group several files into 1 physical file. 1024 file in 1 storage file seems reasonable. To locate the individual files in these storage files you have to use some RDBMS (PostgreSQL was mentioned and it is good but SQLite may be better suited to this) or similar structure to do the mapping.
The directory structure suggested by your friend sounds good but it does not solve the physical storage problem. You may use similar directory structure to store the storage files. It is better to name them by using a numerical system.
If you can, do not let them format as FAT32, at least NTFS or some recent File System of Unix flavor. As total size of the files is not that big, NTFS may be sufficient but ZFS is the better option...

Is there any relation between individual files? As far as access times go, what folders you put things in won't affect much; the physical locations on the disk are what matter.

Why isn't storing the paths in a database table acceptable?

My guess is he is thinking of a Trie data structure to create on disk where the node is a directory.

I'd check out hadoops model.
P

I know this is a few years late, but maybe this can help the next guy..
My suggestion use a SAN, mapped to a Z drive that other servers can map to as well. I wouldn't go with the folder path your friend said to go with, but more with a drive:\clientid\year\month\day\ and if you ingest more than 100k docs a day, then you can add sub folders for hour and even minute if needed. This way, you never have more than 60 sub folders while going all the way down to seconds if required. Store the links in SQL for quick retrieval and reporting. This makes the folder path pretty short for example: Z:\05\2004\02\26\09\55\filename.txt so you don't run into any 256 limitations across the board.
Hope that helps someone. :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight