I would like to embed some meta data in a windows file.
I came across the concept of extended file attributes, which I believe are used for this very purpose. For example, camera name in jpgs, episode name in avis.
Apart from some very obscure non-documented kernel APIs, I cannot find how to do this in c/c++ using the win32 api.
Extended Attributes are a property of the filesystem, i.e. NTFS. The tags associated with jpegs and AVIs are stored within the file itself. The Win32 API's will only provide you with the EA's from the filesystem, not the ones embedded within the files. You'll have to look into third-party libraries to retrieve the embedded attributes.
In the general case, metadata can be formatted in any way that is easy for your application to access. The RDF specification was created to provide a standard set of metadata capabilities that cover most of the generally useful kinds of information.
However, the problem is always finding a way to store it alongside the real data in a way that doesn't disturb applications that think they know how to handle the format. This can be particularly tricky for well-known formats.
Adobe has done a lot of research on this problem, and is backing a technology they call XMP to achieve a good result. XMP includes metadata in a style closely related to RDF, along with conventions for packing it inside many other file formats, or in side-car files for those cases where there just is no portable way to fit the data inside.
On a Windows system with all files stored on NTFS volumes, it is conceivable that extended attributes and alternate data streams could be used to store metadata. The big issue with this is one of portability. The alternate streams will be lost if the file is copied to media that does not support them, such as any flavor of FAT as well as the file systems used on CDs and DVDs.
This is a serious defect that makes keeping a valid and complete backup of such a file more difficult than is practical for most users.
There are applications that use alternate data streams, but they do so knowing that the value they add can be lost when the file is copied.
Related
We all know that most operating systems use file systems to store all data but don't you think it is more efficient to use databases as we use in websites/web apps?
tl;dr: Diversity.
First of all, if you look at the original FAT filesystem, and the original Unix filesystem, they were both key-value stores, they did not have a directory hierarchy.
Second, this link suggests there there are filesystems implemented with an RDBMS backend, which is tangential to your question.
Having said these, comparing RDBMS to a filesystem as storage for an OS, there are several drawbacks to using RDBMS:
First, RDBMS makes very strong guarantees (ACID) by means of locking, at the cost of performance. However, most programs do not require such guarantees (for examples, think of every program that works with a NoSQL DB). In comparison, POSIX makes strong-ish guarantees about metadata, but barely any guarantees about I/O. You can build an RDBMS on top of POSIX and add locking, but you can't build a filesystem on top of an RDBMS and remove locking.
Second, an RDBMS requires a schema. Imagine that you create a new storage volume for an OS. Instead of formatting a filesystem, you need to decide on a schema. What schema will be the most useful?
With filesystems, the "schema" is basically one table, with the columns "path", "data", and a column for each file attributes like modification time, type, and size. Using an RDBMS for this schema allows you to perform operations like mass truncate, mass rename, mass access control etc. atomically. However, it will not allow you to modify the data of the same record (file) concurrently. Nor will it allow you to implement hard links. Extended attributes or Alternate Data Streams will still have to be implemented as they are today rather than leveraging RDBMS capabilities, as well as special index logic for the path column in order to implement features like changing directory, listing directory, checking permissions for every directory in the path of a file etc., and special logic for the data column because files can be TBs in size. At that point the ROI of RDBMS is going down the more you add features.
Alternatively you can have the schema be per-program (i.e. every program can do CREATE TABLE etc.), but then your features are again limited by what the RDBMS can do. For example, how do you get the equivalent of find / -size +1GB or md5sum, or even cat or ls? which columns will these programs read? You'll find that all generic programs now need to take a set of columns that are of interest. It also makes scripting much harder.
Thirdly, Hierarchical systems are typically easier to scale.
One example is when you want to add storage. In a hierarchical filesystem, even without any fancy filesystem features, you can simply mount another filesystem onto a directory, and you have new storage. The tradeoff vs increasing the storage capacity for the current filesystem is that hard links & renames don't work across filesystem, and they don't share the storage capacity. However, on an RDBMS your options are either to create a new table and have your programs/scripts manage both tables, or to add more storage volume, for which you might need to do more advanced things like partitioning.
Another example is ecosystem requirements. As an end user wanting to put some order into their 60,000 pictures, 5000 songs, hundreds of work spreadsheets, 10,000 memes, hundreds of eBooks, videos etc. - things that are convenient to arrange in a hierarchy - you currently only need two programs - a file manager (Explorer, bash, Nautilus etc.), and a search capability (e.g. find(1)). On an RDBMS, you either have different tables with different columns, or one table with generic columns. Either way, you have to have a set of SQL scripts to work with these specific collections, which would be equivalent to having a shell script or a program for each type of collection. Meaning, managing large collections requires more programming.
Since hierarchical systems are useful in a generic context (which is the context the major OSes operate in), and since it's easier to build a non-hierarchical system on top of hierarchical one than doing the other way around (hierarchical filesystem cache even makes the job easier for libsqlfs), it is valuable for OSes to support hierarchical systems first-class.
The executive summary is: OSes serve many use cases, and storage access is a major part of that. It would be wise for an OS to build a storage access mechanism that's as minimal as possible, but that allows applications to build more specialized storage access mechanism on top of the OS.
That means providing a small but useful set of features (like permissions, locking, mounting, and symlinks) but not force too much requirements (like locking, or specifying the data format to the OS).
RDBMSes are just too specific.
I am writing an application, which parses a large file, generates a large amount of data and do some complex visualization with it. Since all this data can't be kept in memory, I did some research and I'm starting to consider embedded databases as a temporary container for this data.
My question is: is this a traditional way of solving this problem? And is an embedded database (other than structuring data) supposed to manage data by keeping in memory only a subset (like a cache), while the rest is kept on disk? Thank you.
Edit: to clarify: I am writing a desktop application. The application will be inputted with a file of size of 100s of Mb. After reading the file, the application will generate a large number of graphs which will be visualized. Since, the graphs may have such a large number of nodes, they may not fit into memory. Should I save them into an embedded database which will take care of keeping only the relevant data in memory? (Do embedded databases do that?), or I should write my own sophisticated module which does that?
Tough question - but I'll share my experience and let you decide if it helps.
If you need to retain the output from processing the source file, and you use that to produce multiple views of the derived data, then you might consider using an embedded database. The reasons to use an embedded database (IMHO):
To take advantage of RDBMS features (ACID, relationships, foreign keys, constraints, triggers, aggregation...)
To make it easier to export the data in a flexible manner
To enable access to your processed data to external clients (known format)
To allow more flexible transformation of the data when preparing for viewing
Factors which you should consider when making the decision:
What is the target platform(s) (windows, linux, android, iPhone, PDA)?
What technology base? (Java, .Net, C, C++, ...)
What resource constraints are expected or need to be designed for? (RAM, CPU, HD space)
What operational behaviours do you need to take into account (connected to network, disconnected)?
On the typical modern desktop there is enough spare capacity to handle most operations. On eeePCs, PDAs, and other portable devices, maybe not. On embedded devices, very likely not. The language you use may have build in features to help with memory management - maybe you can take advantage of those. The connectivity aspect (stateful / stateless / etc.) may impact how much you really need to keep in memory at any given point.
If you are dealing with really big files, then you might consider a streaming process approach so you only have in memory a small portion of the overall data at a time - but that doesn't really mean you should (or shouldn't) use an embedded database. Straight text or binary files could work just as well (record based, column based, line based... whatever).
Some databases will allow you more effective ways to interact with the data once it is stored - it depends on the engine. I find that if you have a lot of aggregation required in your base files (by which I mean the files you generate initially from the original source) then an RDBMS engine can be very helpful to simplify your logic. Other options include building your base transform and then adding additional steps to process that into other temporary stores for each specific view, which are then in turn processed for rendering to the target (report?) format.
Just a stream-of-consciousness response - hope that helps a little.
Edit:
Per your further clarification, I'm not sure an embedded database is the direction you want to take. You either need to make some sort of simplifying assumptions for rendering your graphs or investigate methods like segmentation (render sections of the graph and then cache the output before rendering the next section).
I've been thinking on this for a while now (you know, that dangerous thing programmers tend to do) and I've been wondering, is the method of storing data that we're so accustomed to really all that efficient? The trouble with answering this question is that I really don't have anything to compare it to, since it's the only thing I've ever used.
I don't mean FAT or NTFS or a particular type of file system, I mean the filesystem structure as a whole. We are simply used to thinking of "files" inside "folders" like our hard drive was one giant filing cabinet. This is a great analogy and indeed, it makes it a lot easier to learn when we think of it this way, but is it really the best way to go about describing programs and their respective parts?
I'd like to know if anyone can think of (or knows about) a data storage technique that might be used to store data for an Operating System to use that would organize the parts of data in a different manner. Does anything... different even exist?
Emails are often stored in folders. But ever since I have migrated to Gmail, I have become accustomed to classifying my emails with tags.
I often wondered if we could manage a whole file-system that way: instead of storing files in folders, you could tag files with the tags you like. A file identifier would not look like this:
/home/john/personal/contacts.txt
but more like this:
contacts[john,personal]
Well... just food for thought (maybe this already exists!)
You can for example have dedicated solutions, like Oracle Raw Partitions. Other databases support similar thing. In these cases the filesystem provides unnecessary overhead and can be ommited - DB software will take care of organising the structure.
The problem seems very application dependent and files/folders seem to be a reasonable compromise for many applications (and is easy for human beings to comprehend).
Mainframes used to just give programmers a number of 'devices' to use. The device corresponsed to a drive or a partition thereof and the programmer was responsible for the organisation of all data on it. Of course they quickly built up libraries to help with that.
The only OS I think think of that does use the common hierachical arrangement of flat files (like UNIX) is PICK. That used a sort of relational database as the filesystem.
Microsoft had originally planned to introduce a new file-system for windows vista (WinFS - windows future storage). The idea was to store everything in a relational database (SQL Server). As far as I know, this project was never (or not yet?) finished.
There's more information about it on wikipedia.
I knew a guy who wrote his doctorate about a hard disk that comes with its own file system. It was based on an extension of SCSI commands that allowed the usual open, read, write and close commands to be sent to the disk directly, bypassing the file system drivers of the OS. I think the conclusion was that it is inflexible, and does not add much efficiency.
Anyway, this disk based file system still had a folder like structure I believe, so I don't think it really counts for you ;-)
Well, there's always Pick, where the OS and file system were an integrated database.
Traditional file systems are optimized for fast file access if you know the name of the file you want (including its path). Directories are a way of grouping files together so that they're easier to find if you know properties of the file but not its actual name.
Traditional file systems are not good at finding files if you know very little about them, however they are robust enough that one can add a layer on top of them to aid in retrieving files based on content or meta-information such as tags. That's what indexers are for.
The bottom line is we need a way to store persistently the bytes that the CPU needs to execute. So we have traditional file systems which are very good at organizing sequential sets of bytes. We also need to store persistently the bytes of files that aren't executed directly, but are used by things that do execute. Why create a new system for the same fundamental thing?
What more should a file system do other than store and retrieve bytes?
I'll echo the other responses. If I could pick a filesystem type, I personally would rather see a hybrid approach: a flat database of subtrees, where each subtree is considered as a cohesive unit, but if you consider the subtrees themselves as discrete units they would have no hierarchy, but instead could have metadata + be queryable on that metadata.
The reason for files is that humans like to attach names to "things" they have to use. Otherwise, it becomes hard to talk or think about or even distinguish them.
When we have too many things on a heap, we like to separate the heap. We sort it by some means and we like to build hierarchies where you can navigate arbitrarily sized amounts of things.
Hence directories and files just map our natural way of working with real objects. Since you can put anything in a file. On Unix, even hardware is mapped as "device nodes" into the file system which are special files which you can read/write to send commands to the hardware.
I think the metaphor is so powerful, it will stay.
I spent a while trying to come up with an automagically versioning file system that would maintain versions (and version history) of any specific file and/or directory structure.
The idea was that all of the standard access command (e.g. dir, read, etc.) would have an optional date/time parameter that could be passed to access the file system as it looked at that point in time.
I got pretty far with it, but had to abandon it when I had to actually go out and earn some money. It's been on the back-burner since then.
If you take a look at the start-up times for operating systems, it should be clear that improvements in accessing disks can be made. I'm not sure if the changes should be in the file system or rather in the OS start-up code.
Personally, I'm really sorry WinFS didn't fly. I loved the concept..
From Wikipedia (http://en.wikipedia.org/wiki/WinFS) :
WinFS includes a relational database
for storage of information, and allows
any type of information to be stored
in it, provided there is a well
defined schema for the type.
Individual data items could then be
related together by relationships,
which are either inferred by the
system based on certain attributes or
explicitly stated by the user. As the
data has a well defined schema, any
application can reuse the data; and
using the relationships, related data
can be effectively organized as well
as retrieved. Because the system knows
the structure and intent of the
information, it can be used to make
complex queries that enable advanced
searching through the data and
aggregating various data items by
exploiting the relationships between
them.
A database file system is a file system that is a database instead of a hierarchy. Not too complex an idea initially but I thought I'd ask if anyone has thought about how they might do something like this? What are the issues that a simple plan is likely to miss? My first guess at an implementation would be something like a filesystem to for a Linux platform (probably atop an existing file system) but I really don't know much about how that would be started. Its a passing thought that I doubt I'd ever follow through on but I'm hoping to at least satisfy my curiosity.
DBFS is a really nice PoC implementation for KDE. Instead of implementing it as a file system directly, it is based on indexing on a traditional file system, and building a new user interface to make the results accessible to users.
The easiest way would be to build it using fuse, with a database back-end.
A more difficult thing to do is to have it as a kernel module (VFS).
On Windows, you could use IFS.
I'm not really sure what you mean with "A database file system is a file system that is a database instead of a hierarchy".
Probably, using "Filesystem in Userspace" (FUSE), as mentioned by Osama ALASSIRY, is a good idea. The FUSE wiki lists a lot of existing projects about databased-backed filesystems as well as filesystems in which you can search by SQL-like queries.
Maybe this is a good starting point for getting an idea how it could work.
It's a basic overview of the Firebird architecture.
Firebird is an opensource RDBMS, so you can have a real deep insight look, too, if you're interested.
Its been a while since you asked this. I'm surprised no one suggested the obvious. Look at mainframes and minis, especially iSeries-OS (now called IBM-i used to be called iOS or OS/400).
How to do an relational database as a mass data store is relatively easy. Oracle and MySQL both have these. The catch is it must be essentially ubiquitous for end user applications.
So the steps for an app conversion are:
1) Everything in a normal hierarchical filesystem
2) Data in BLOBs with light metadata in the database. File with some catalogue information.
3) Large data in BLOBs with extensive metadata and complex structures in the database. File with substantial metadata associated with it that can be essentially to understanding the structure.
4) Internal structures of the BLOB exposed in an object <--> Relational map with extensive meta-data. While there may be an exportable form, the application naturally works with the database, the notion of the file as the repository is lost.
I have an experiment streaming up 1Mb/s of numeric data which needs to be stored for later processing.
It seems as easy to write directly into a database as to a CSV file and I would then have the ability to easily retrieve subsets or ranges.
I have experience of sqlite2 (when it only had text fields) and it seemed pretty much as fast as raw disk access.
Any opinions on the best current in-process DBMS for this application?
Sorry - should have added this is C++ intially on windows but cross platform is nice. Ideally the DB binary file format shoudl be cross platform.
If you only need to read/write the data, without any checking or manipulation done in database, then both should do it fine. Firebird's database file can be copied, as long as the system has the same endianess (i.e. you cannot copy the file between systems with Intel and PPC processors, but Intel-Intel is fine).
However, if you need to ever do anything with data, which is beyond simple read/write, then go with Firebird, as it is a full SQL server with all the 'enterprise' features like triggers, views, stored procedures, temporary tables, etc.
BTW, if you decide to give Firebird a try, I highly recommend you use IBPP library to access it. It is a very thin C++ wrapper around Firebird's C API. I has about 10 classes that encapsulate everything and it's dead-easy to use.
If all you want to do is store the numbers and be able to easily to range queries, you can just take any standard tree data structure you have available in STL and serialize it to disk. This may bite you in a cross-platform environment, especially if you are trying to go cross-architecture.
As far as more flexible/people-friendly solutions, sqlite3 is widely used, solid, stable,very nice all around.
BerkeleyDB has a number of good features for which one would use it, but none of them apply in this scenario, imho.
I'd say go with sqlite3 if you can accept the license agreement.
-D
Depends what language you are using. If it's C/C++, TCL, or PHP, SQLite is still among the best in the single-writer scenario. If you don't need SQL access, a berkeley DB-style library might be slightly faster, like Sleepycat or gdbm. With multiple writers you could consider a separate client/server solution but it doesn't sound like you need it. If you're using Java, hdqldb or derby (shipped with Sun's JVM under the "JavaDB" branding) seem to be the solutions of choice.
You may also want to consider a numeric data file format that is specifically geared towards storing these types of large data sets. For example:
HDF -- the most common and well supported in many languages with free libraries. I highly recommend this.
CDF -- a similar format used by NASA (but useable by anyone).
NetCDF -- another similar format (the latest version is actually a stripped-down HDF5).
This link has some info about the differences between the above data set types:
http://nssdc.gsfc.nasa.gov/cdf/html/FAQ.html
I suspect that neither database will allow you to write data at such high speed. You can check this yourself to be sure. In my experience - SQLite failed to INSERT more then 1000 rows per second for a very simple table with a single integer primary key.
In case of a performance problem - I would use CSV format to write the files, and later I would load their data to the database (SQLite or Firebird) for further processing.