Methods for storing metadata associated with individual files? - file

Given a collection of files which will have associated metadata, what are the recommended methods for storing this metadata?
Some files formats support storing metadata internally (EXIF,ID3,etc), but not all file formats support this, so what are more general options?
Some of the metadata would almost certainly be unique (titles/descriptions/etc), whilst some would be repetitive to varying degrees (categories/tags/etc).
It may also be useful to group the metadata, if different types of attribute are required.
Ideally, solutions should cover concepts, rather than specific language implementations.

To store metadata in database has some advantages but main problem with database is that metadata are not directly connected to your data. It is more robust if metada stay with data - like special file in the directory or something like that.
Some filesystems offer special functionality that can be used for metadata - like NTFS Alternate streams. Unfortunately, this can be used for metadata storage in special cases only, because those streams can be easily lost when copying data to storage system that does not support it. I believe that linux filesystems have also similar storage mechanism.
Anyway, most common solutions are :
separate hidden file(s) (per directory) that hold metadata
some application use special hidden directory with metadata (like subversion, cvs etc).
or database (of various kinds) for all application specific metada - this database can be used also for caching purposes in most cases
IMO there is no general purpose solution. I would choose storage of metadata in hidden file (robustness) with use of the database for fast access and caching.

One option might be a relational database, structured like this:
FILE
f_id
f_location
f_title
f_description
ATTRIBUTE
a_id
a_label
VALUE
v_id
v_label
METADATA
md_file
md_attribute
md_value
This implementation has some unique information (title/description),
but is primarily targetted at repetitive groups of data.
For some requirements, other less generic tables may be more useful.
This has advantages of this being that relational databases are very common,
and obviously very good at handling relationships and storing lots of data.
However, for some uses a database server brings an overhead which might not be desirable.
Also, the database server is distinct from the files - they do not sit together, and require different methods of interaction.
Databases do not (easily) sit under version control - which may be a good or bad thing, depending on your point of view and specific needs.

I think the "solution" depends greatly upon what you're going to be doing with the metadata.
For example, almost all of the metadata we store (Multiple datasets of scientific data) are all chopped up and stored in a database. This allows us to create datasets to preserve the common metadata between the files (as you say, categories and tags) while we have file specific structures (title, start/stop time, min/max values etc.) While we could keep these in hidden files, we do a lot of searching and open our interface to outside consumers via web services.
If you're storing metadata that isn't going to be searched on, hidden files or a dedicated .xml file per "real" file isn't a bad route to take. It's readable by basically anything, can be converted to different formats easily, and won't be lost if you decide to change your storage mechanism.
Metadata should help you, not hinder you. I've seen (and been a part of) systems where metadata storage has become more burdensome than storing the actual data, and became a liability. Just keep in mind what you are trying to do with it, and don't over extend yourself with "what ifs."

Plain text has some obvious advantages over anything else. Something like
FileName = 'ferrari.gif'
Title = 'My brand new car'
Tags = 'cars', 'cool'
Related = 'michaelknight.mp3'
Picasa's Picasa.ini files are a good example for this kind of metadata. Also, instead of inventing your own format, XML might be worth considering. There are plenty of readily available DOM processors to deal with this format.
Then again, if the amount of files and relations between them is huge, databases may be better.

I would basically make a metadata DB which held this information:
RESOURCE_TABLE
RESOURCE_ID
RESOURCE_TYPE (folder, doctype, web link, other)
RESOURCE_URL (any URL)
NOTES_TABLE
NOTE_ID
RESOURCE_NO
RESOURCE_NOTE (long text)
TAGS_TABLE
TAG_ID
RESOURCE_NO
TAG_TEXT
Then I would use the note field textual notes to the file/folder/resource. Choose if you would use 1:1 or 1:N for this.
The tags field I would use to store any number of searchable parameters like YEAR, PROJECT, and other values that will describe and group your content.
Then you could add tables for owner, stakeholders, and other organisation info etc.

Related

Should I store uploaded filename in database?

I have a database table with an autoincrement ID as primary key.
For each record of this table, I can have up to 3 files, which can be publicly available so random filename generation is not mandatory, and these files are optional.
I think I have 2 possible solutions:
Store a random generated filename in 3 nullable varchar column and store all the files in the same place:
columns: a | b | c
uploads/f6se54fse654.jpg
Don't store the filenames, but place them in specific folders and name them the same than the primary key value:
uploads/a/1.jpg
uploads/b/1.jpg
uploads/c/1.jpg
With this last solution, I know that uploads/a/1.jpg belongs to record with ID 1, and is a file of type a. But I have to check if the file exists because the files are optional.
Do you think there is a good practice in all that? Or maybe there is a better approach?
If the files you are talking about are intended to be displayed or downloaded by users (whether for visitors or for authenticated users, filtered by roles (ACL) or not), it is important to ensure (IMHO) that the user will not be able to guess other information other than the content of the concerned resource which has been sent to him. There is no perfect solution that can be applied to all cases without exception, so let's take an example to give you more explanations.
In order to enhance the security and total opacity of sensitive data, for example for the specific case of uploads/users/7/invoices/3.pdf, I think it would be wise to ensure that absolutely no one can guess the number of files that are potentially associated with the user or any other entity (because otherwise, in this example, we could imagine that there potentially are other accessible files - 1.pdf and 2.pdf). By design, we generally want to give access to files in a well defined and specific cases and context. However, this may not be the case for an image file which is intended to be seen by everyone (a profile photo, for example). That's why the context matters in some way.
If you choose to keep the auto-incremented identifiers as names to refer to your files, this can also give information about the size of the data stored in your database (/uploads/invoices/128.pdf informs that you may already have 127 invoices on your server) and potentially motivate unscrupulous people to try to reach resources that should never be fetched out of the defined context. This case may be less obvious if you choose to use some kind of unique generated identifiers (GUID).
I recommend that you read this article concerning the generation of (G)/(U)UIDs (a 128-bit hexadecimal numbers) to be stored in your database for each uploaded or created file. If you use MySQL in its latest version it is even possible to host this identifier in a binary (16) type which offers an automatic conversion to UUID, I let you read this interesting topic associated with what I refer about. It will probably output this as /uploads/invoices/b0016303-8e4f-487a-8c30-5dddf1ebf7e9.pdf which is a lot better as long as you ensure that the generated identifier is unique hash.
It does not seem useful to me here to talk about performance issues because today there are many methods for caching files or path and urls, which avoid having to make requests each time in a lot of cases where a resource is called (often ordered by their popularity rank in bigdata cases).
Last, but not least, many web and mobile platform applications (I think of Slack, Discord, Facebook, Twitter...) which store a lot of media files every day which are often associated with accounts users, both public and confidential files and information, generate a unique hash for each of them.
Twitter is using its own unique identifier string (64-bits BIGINT) generator called Twitter Snowflake which you might be interesting to read too. It is based on the UNIX epoch value which is, by definition, unique at each millisecond tick.
There isn't a global and perfect solution which can be applied for everything but I hope that this will help you as you may want to take a deeper look in this and find the "best solution" for each context and entity you'll store and link files.

Key/Value database for storing binary data

I am looking for a lightweight, reliable and fast key/value database for storing binary data. Simple without server. Most of popular key/value databases like CDB and BerkeleyDB does not natively store BLOB. What can be the best choice that I missed?
My current choice is SQLite, but it is too advanced for my simple usage.
As it was previously pointed out, BerkeleyDB does support opaque values and keys, but I would suggest a better alternative: LevelDB.
LevelDB:
Google is your friend :), so much so that they even provide you with an embedded database: A fast and lightweight key/value database library by Google.
Features:
Keys and values are arbitrary byte arrays.
Data is stored sorted by key.
Callers can provide a custom comparison function to override the sort order.
The basic operations are Put(key,value), Get(key), Delete(key).
Multiple changes can be made in one atomic batch.
Users can create a transient snapshot to get a consistent view of data.
Forward and backward iteration is supported over the data.
Data is automatically compressed using the Snappy compression library.
External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
Detailed documentation about how to use the library is included with the source code.
What makes you think BerkDB cannot store binary data? From their docs:
Key and content arguments are objects described by the datum typedef. A datum specifies a string of dsize bytes pointed to by dptr. Arbitrary binary data, as well as normal text strings, are allowed.
Also see their examples:
money = 122.45;
key.data = &money;
key.size = sizeof(float);
...
ret = my_database->put(my_database, NULL, &key, &data, DB_NOOVERWRITE);
If you don't need "multiple writer processes" (only multiple readers works), want something small and want something that is available on nearly every linux, you might want to take a look at gdbm, which is like berkeley db, but much simpler. Also it's possibly not as fast.
In nearly the same area are things like tokyocabinet, qdbm, and the already mentioned leveldb.
Berkeley db and sqlite are ahead of those, because they support multiple writers. berkeley db is a versioning desaster sometimes.
The major pro of gdbm: It's already on every linux, no versioning issues, small.
Which OS are you running? For Windows you might want to check out ESE, which is a very robust storage engine which ships as part of the OS. It powers Active Directory, Exchange, DNS and a few other Microsoft server products, and is used by many 3rd party products (RavenDB comes to mind as a good example).
You may have a look at http://www.codeproject.com/Articles/316816/RaptorDB-The-Key-Value-Store-V2 my friend of Databases.
Using sqlite is now straightforward with the new functions readfile(X) and writefile(X,Y) which are available since 2014-06. For example to save a blob in the database from a file and extract it again
CREATE TABLE images(name TEXT, type TEXT, img BLOB);
INSERT INTO images(name,type,img) VALUES('icon','jpeg',readfile('icon.jpg'));
SELECT writefile('icon-copy2.jpg',img) FROM images WHERE name='icon';
see https://www.sqlite.org/cli.html "File I/O Functions"

How do I index different sources in Solr?

How do I index text files, web sites and database in the same Solr schema? All 3 sources are a requirement and I'm trying to figure out how to do it. I did some examples and they're working fine as they're separate from each other, now I need them all to be 1 schema since the user will be searching in all of those 3 data sources.
How should I proceed?
You should sketch up a few notes for each of your content sources:
What meta-data is available
How is the information accessed
How do I want to present the information
Once that is done, determine which meta-data you want to make searchable. Some of it might be very specific to just one of the content sources (such as author on web pages, or any given field in a DB row), while others will be present in all sources (such as unique ID, title, text content). Use copy-fields to consolidate fields as needed.
Meta-data will vary greatly from project to project, but yes -- things like update date, filename, and any structured data you can parse out of the text files will surely help you improve relevance. Beyond that, it varies a lot from case to case. Maybe the file paths hint at a (possibly informal) taxonomy you can use as metadata. Maybe filenames contain metadata themselves (such as year, keyword, product names, etc).
Be prepared to use different fields for different sources when displaying results. A source field goes a long way in terms of creating result tiles -- and it might turn out to be your most used facet.
An alternative (and probably preferred) approach to using copy-fields extensively, is using the DisMax/EDisMax request handlers, to facilitate searching in several fields.
Consider using a mix of copy-fields and (e)dismax. For instance, copy all fields into a catch-all text-field, that need not be stored, and include it in searches, but with a low boost-value, and include highly weighted fields (such as title, or headings, or keywords, or filename) in the search. There's a lot of parameters to tweak in dismax, but it's definately worth the effort.

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

How do you structure config data in a database?

What is people's prefered method of storing application configuration data in a database. From having done this in the past myself, I've utilised two ways of doing it.
You can create a table where you store key/value pairs, where key is the name of the config option and value is its value. Pro's of this is adding new values is easy and you can use the same routines to set/get data. Downsides are you have untyped data as the value.
Alternatively, you can hardcode a configuration table, with each column being the name of the value and its datatype. The downside to this is more maintenance setting up new values, but it allows you to have typed data.
Having used both, my preferences lie with the first option as its quicker to set things up, however its also riskier and can reduce performance (slightly) when looking up data. Does anyone have any alternative methods?
Update
It's necessary to store the information in a database because as noted below, there may be multiple instances of the program that require configuring the same way, as well as stored procedures potentially using the same values.
You can expand option 1 to have a 3rd column, giving a data-type. Your application can than use this data-type column to cast the value.
But yeah, I would go with option 1, if config files are not an option. Another advantage of option 1 is you can read it into a Dictionary object (or equivalent) for use in your application really easily.
Since configuration typically can be stored in a text file, the string data type should be more than enough to store the configuration values. If you're using a managed language, it's the code that knows what the data type should be, not the database.
More importantly, consider these things with configuration:
Hierarchy: Obviously, configuration will benefit from a
hierarchy
Versioning: Consider the benefit of being able to roll back to the configuration that was in effect at a certain date.
Distribution: Some time, it might be nice to be able to cluster an application. Some properties should probably be local to each node in a cluster.
Documentation: Depending on if you have a web tool or something, it is probably nice to store the documentation about a property close to the code that uses it. (Code annotations is very nice for this.)
Notification: How is the code going to know that a change has been made somewhere in the configuration repository?
Personally, i like an inverted way of handling configuration, where the configuration properties is injected into the modules which don't know where the values came from. This way, the configuration management system can be very complex or very simple depending on your (current) needs.
I use option 1.
My project uses a database table with four columns:
ID [pk]
Scope (default 'Application')
Setting
Value
Settings with a Scope of 'Application' are global settings, such as Maximum number of simultaneous users.
Each module has its own scope based; so our ResultsLoader and UserLoader have different scopes, but both have a Setting named 'inputPath'.
Defaults are either provided in the source code or are injected via our IoC container. If no value is injected or provided in the database, the default from the code is used (if one exists). Therefore, defaults are never stored in the database.
This works out quite well for us. Each time we backup the database we get a copy of the Configuration which is quite handy. The two are always in sync.
It seems overkill to use the DB for config data.
EDIT (sorry too long for comment box):
Of course there's no strict rules on how you implement any part of your program. For the sake of argument, slotted screwdrivers work on some philips screws! I guess I judged too early before knowing what your scenario is.
Relational database excels in massive data store that gives you quick storing, updating, and retrieval, so if your config data is updated and read constantly, then by all means use db.
Another scenario where db may make sense is when you have a server farm where you want your database to store your central config, but then you can do the same with a shared networked drive that point to the xml config file.
XML file is better when your config is hierarchically structured. You can easily organize, locate, and update what you need, and for bonus benefit you can version control the config file along with your source code!
All in all, it all depends on how the config data is used.
That concludes my opinion with limited knowledge of your application. I am sure you can make the right decision.
I guess this is more of a poll, so I'll say the column approach (option 2). However it will depend on how often your config changes, how dynamic it is, and how much data there is, etc.
I'd certainly use this approach for user configurations / preferences, etc.
Go with option 2.
Option 1 is really a way of implenting a database on top of a database, and that is a well-known antipattern, which is just going to give you trouble in the long run.
I can think of at least two more ways:
(a) Create a table with key, string-value, date-value, int-value, real-value columns. Leave unused types NULL.
(b) Use a serialization format like XML, YAML or JSON and store it all in a blob.
Where do you you store the configuration settings your app needs to connect to the database?
Why not store the other config info there too?
I'd go with option 1, unless the number of config options were VERY small (seven or less)
At my company, we're working on using option one (a simple dictionary-like table) with a twist. We're allowing for string substitution using tokens which contain the name of the config variable to be substituted.
For example, the table might contain rows ('database connection string', 'jdbc://%host%...') and ('host', 'foobar'). Encapsulating that with a simple service or stored procedure layer allows for an extremely simple, but flexible, recursive configuration. It supports our need to have multiple isolated environments (dev, test, prod, etc).
I've used both 1 and 2 in the past, and I think they're both terrible solutions. I think Option 2 is better because it allows typing, but it's a lot more ugly than option 1. The biggest problem I have with either is versioning the config file. You can version SQL reasonably well using standard version control systems, but merging changes is usually problematic. Given an opportunity to do this "right", I'd probably create a bunch of tables, one for each type of configuration parameter (not necessarily for each parameter itself), thus getting the benefit of typing and the benefit of the key/value paradigm where appropriate. You can also implement more advanced structures this way, such as lists and hierarchies, which will then be directly queryable by the app instead of having to load the config and then transform it somehow in memory.
I vote for option 2. Easy to understand and maintain.
Option 1 is good for an easily expandable, central storage location. In addition to some of the great column suggestions by folks like RB, Hugo, and elliott, you might also consider:
Include a Global/User setting flag with a user field or even a user/machine field (for machine-specific UI type settings).
Those can, of course, be stored in a local file, but since you are using the database anyway, that makes these available for aliasing a user when debugging - which can be important if the bug is setting related. It also allows an admin to manage setings when necessary.
I use a mix of option 2 and XML columns in SQL server.
You may also wan't to add a check constraint to keep the table at one row.
CREATE TABLE [dbo].[MyOption] (
[GUID] uniqueidentifier CONSTRAINT [dfMyOptions_GUID] DEFAULT newsequentialid() ROWGUIDCOL NOT NULL,
[Logo] varbinary(max) NULL,
[X] char(1) CONSTRAINT [dfMyOptions_X] DEFAULT 'X' NOT NULL,
CONSTRAINT [MyOptions_pk] PRIMARY KEY CLUSTERED ([GUID]),
CONSTRAINT [MyOptions_ck] CHECK ([X]='X')
)
for settings that have no relation to any db tables, i'd probably go for the EAV approach if you need the db to work with the values. otherwise a serialized field value is good if it's really just a store for app code.
but what about a format for a single field to store multiple config settings to be used by the db?
like one field per user that contains all their settings related to their messageboard view (like default sort order, blocked topics, etc.), and maybe another with all their settings for their theme (like text color, bg color, etc.)
Storing hierarchy and documents in a relational DB is madness. Firstly you either have to shred them, only to recombine them at some later stage. Or there bunged inside a BLOB, even more stupid.
Don't use use a relational db for non-relational data, the tool does not fit. Consider something like MongoDB or CouchDB for this. Schema-less no-relational data stores. Store it as JSON if it's coming down the wire in any way to a client, use XML for serverside.
CouchDB gives you versioning out of the box.
Don't store configuration data in a database unless you have a very good reason to. If you do have a very good reason, and are absolutely certain you are going to do it, you should probably store it in a data serialization format like JSON or YAML (not XML, unless you actually need a markup language to configure your app -- trust me, you don't) as a string. Then you can just read the string, and use tools in whatever language you work in to read and modify it. Store the strings with timestamps, and you have a simple versioning scheme with the ability to store hierarchical data in a very simple system. Even if you don't need hierarchical config data, at least now if you need it in the future you won't have to change your config interface to get it. Of course you lose the ability to do relational queries on your config data, but if you're storing that much config data, then you're probably doing something very wrong anyway.
Companies tend to store lots configuration data for their systems in a database, I'm not sure why, I don't think much thought goes into these decisions. I don't see this kind of thing done too often in the OSS world. Even large OSS programs that need lots of configuration like Apache don't need a connection to a database containing an apache_config table to work. Having a huge amount of configuration to deal with in your apps is a bad code smell, storing that data in a database just causes more problems (as this thread illustrates).

Resources