Key/Value database for storing binary data - database

I am looking for a lightweight, reliable and fast key/value database for storing binary data. Simple without server. Most of popular key/value databases like CDB and BerkeleyDB does not natively store BLOB. What can be the best choice that I missed?
My current choice is SQLite, but it is too advanced for my simple usage.

As it was previously pointed out, BerkeleyDB does support opaque values and keys, but I would suggest a better alternative: LevelDB.
LevelDB:
Google is your friend :), so much so that they even provide you with an embedded database: A fast and lightweight key/value database library by Google.
Features:
Keys and values are arbitrary byte arrays.
Data is stored sorted by key.
Callers can provide a custom comparison function to override the sort order.
The basic operations are Put(key,value), Get(key), Delete(key).
Multiple changes can be made in one atomic batch.
Users can create a transient snapshot to get a consistent view of data.
Forward and backward iteration is supported over the data.
Data is automatically compressed using the Snappy compression library.
External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
Detailed documentation about how to use the library is included with the source code.

What makes you think BerkDB cannot store binary data? From their docs:
Key and content arguments are objects described by the datum typedef. A datum specifies a string of dsize bytes pointed to by dptr. Arbitrary binary data, as well as normal text strings, are allowed.
Also see their examples:
money = 122.45;
key.data = &money;
key.size = sizeof(float);
...
ret = my_database->put(my_database, NULL, &key, &data, DB_NOOVERWRITE);

If you don't need "multiple writer processes" (only multiple readers works), want something small and want something that is available on nearly every linux, you might want to take a look at gdbm, which is like berkeley db, but much simpler. Also it's possibly not as fast.
In nearly the same area are things like tokyocabinet, qdbm, and the already mentioned leveldb.
Berkeley db and sqlite are ahead of those, because they support multiple writers. berkeley db is a versioning desaster sometimes.
The major pro of gdbm: It's already on every linux, no versioning issues, small.

Which OS are you running? For Windows you might want to check out ESE, which is a very robust storage engine which ships as part of the OS. It powers Active Directory, Exchange, DNS and a few other Microsoft server products, and is used by many 3rd party products (RavenDB comes to mind as a good example).

You may have a look at http://www.codeproject.com/Articles/316816/RaptorDB-The-Key-Value-Store-V2 my friend of Databases.

Using sqlite is now straightforward with the new functions readfile(X) and writefile(X,Y) which are available since 2014-06. For example to save a blob in the database from a file and extract it again
CREATE TABLE images(name TEXT, type TEXT, img BLOB);
INSERT INTO images(name,type,img) VALUES('icon','jpeg',readfile('icon.jpg'));
SELECT writefile('icon-copy2.jpg',img) FROM images WHERE name='icon';
see https://www.sqlite.org/cli.html "File I/O Functions"

Related

What type of database for storing ML experiments

So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.

Store static data in an array, or in a database?

We always have some static data which can be stored in a file as an array or stored in a database table in our web based project. So which one should be preferred?
In my opinion, arrays have some advantages:
More flexible (it can be any structure, which specifies a really complex relation)
Better performance (it will be loaded in memory, which will have better read/write performance compared with a database's I/O operations)
But my colleague argued that he preferred DB approach, since it can keep a uniform data persistence interface, and be more flexible.
So which should be preferred? Or how can we choose? Or we should prefer one in some scenario and another in other scenarios? what are the scenarios?
EDIT:
Let me clarify something. Truly just as Benjamin made the change to the title, the data we want to store in an array(file) won't change so frequently, which means the code won't change the value of the array in the runtime. If the data change very frequently I will use DB undoubtedly. That's why I made such a post.
And sometimes it's hard to store some really complex relations like:
Task = {
"1" : {
"name" : "xx",
"requirement" : {
"level" : 5,
"money" : 100,
}
...
}
Just like the above code sample(a python dict or you can think it as an array), the requirement field is hard to store in DB(store a structure like pickled object directly in DB? not so good I think). So in such condition, I will prefer arrays.
So what's your idea? In such scenario, we should prefer arrays to DB, right?
Regards.
Lets be pragmatic/objetive:
Do you write to your data on runtime? Yes: Db, No: File
Do you update your data more than once per week? Yes: Db, No: File
It's a pain to release an updated data file? Yes: Db, No: File,
Do you read that data often? Yes: File/Cache, No: Db
It is a pain to update that data file and you need extra tools? Yes: db, No: File
For sure I've forgotten other points, but I guess the basics are there.
The "flexiable" array in a file is fraught with a zillion issues already delt with by using a DB. Unless you can prove that the DB is really going to way slower than using the other approach use a DB. Move on and start solving business problems.
Edit
Comment from OP asks what the issues with using a file might be, here are a handful (pause to take a deep breath).
Concurrency: You have to manage the situation where multiple requests may be trying to write back to the file. Not too hard but it becomes a bottleneck.
Performance: Yes modifying an in-memory array is quicker but how do you determine how much and when the array needs to be persisted to a file. Note that using a DB doesn't pre-clude the use of an appropriate in-memory cache. Writing a file back each time a small modification is made isn't going to perform that well.
Scalability: Really a function of the first two. In order to acheive any scalable goals you need to be able to quickly modify small bits of the data that is persisted. IWO if you don't use a DB you would end up writing one. If you find you need more than one webserver to support growing demand where are you going to store the file(s)? Now you've got file I/O over a network (ableit likely a very quick one).
Structure: Your code will be responsible for managing the structure of data, querying it etc if you use an array. How will you do that in way which acheives greater "flexibility" than using a DB? All manner of choices and complexity are needed here.
Reliability: You need to ensure the integrity of your persisted data. In the event of some failure your array/file code would need to ensure that data is at least not so corrupt that the application can continue.
Your colleague is correct, BUT there's where you need to put aside the comp sci textbook and be pragmatic. How often will you be accessing this data from your application? If it's fairly frequently then don't incur the costs of access overhead. Instead of reading from a flat file you could still gain the advantages of a db, but use a caching strategy in your application. Depending on your development language you could look at something like memcache or jtreecache.
It depends on what kind of data you are looking at, and whether or not it needs to be updated regularly.
I tend to keep most things (non-config data) in the database, even if the data isn't going to be repeating (e.g. thosands of rows). Databases will scale so much easier than a flat file, if your system starts to grow fast your flat file might become a burden to your system.
If the data doesn't change very oftern, and your programming in Java, why not use Spring to hold the values?
They can be injected into your bean, and changed easly.
but thats if you'r developing in Java.
Yeah I agree with your implied assessment that databases are overused and basic flat files may work in multitude of scenarios. If your application is read-only (and writes are done by the admin when app restarts) I would definitely go with the file. Even if application writes to the file, but only in append mode (vs random inserts/updates) in one thread, I would also use file. Anything else -- need a real database with random updates, queries, concurrency control etc.

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

Methods for storing metadata associated with individual files?

Given a collection of files which will have associated metadata, what are the recommended methods for storing this metadata?
Some files formats support storing metadata internally (EXIF,ID3,etc), but not all file formats support this, so what are more general options?
Some of the metadata would almost certainly be unique (titles/descriptions/etc), whilst some would be repetitive to varying degrees (categories/tags/etc).
It may also be useful to group the metadata, if different types of attribute are required.
Ideally, solutions should cover concepts, rather than specific language implementations.
To store metadata in database has some advantages but main problem with database is that metadata are not directly connected to your data. It is more robust if metada stay with data - like special file in the directory or something like that.
Some filesystems offer special functionality that can be used for metadata - like NTFS Alternate streams. Unfortunately, this can be used for metadata storage in special cases only, because those streams can be easily lost when copying data to storage system that does not support it. I believe that linux filesystems have also similar storage mechanism.
Anyway, most common solutions are :
separate hidden file(s) (per directory) that hold metadata
some application use special hidden directory with metadata (like subversion, cvs etc).
or database (of various kinds) for all application specific metada - this database can be used also for caching purposes in most cases
IMO there is no general purpose solution. I would choose storage of metadata in hidden file (robustness) with use of the database for fast access and caching.
One option might be a relational database, structured like this:
FILE
f_id
f_location
f_title
f_description
ATTRIBUTE
a_id
a_label
VALUE
v_id
v_label
METADATA
md_file
md_attribute
md_value
This implementation has some unique information (title/description),
but is primarily targetted at repetitive groups of data.
For some requirements, other less generic tables may be more useful.
This has advantages of this being that relational databases are very common,
and obviously very good at handling relationships and storing lots of data.
However, for some uses a database server brings an overhead which might not be desirable.
Also, the database server is distinct from the files - they do not sit together, and require different methods of interaction.
Databases do not (easily) sit under version control - which may be a good or bad thing, depending on your point of view and specific needs.
I think the "solution" depends greatly upon what you're going to be doing with the metadata.
For example, almost all of the metadata we store (Multiple datasets of scientific data) are all chopped up and stored in a database. This allows us to create datasets to preserve the common metadata between the files (as you say, categories and tags) while we have file specific structures (title, start/stop time, min/max values etc.) While we could keep these in hidden files, we do a lot of searching and open our interface to outside consumers via web services.
If you're storing metadata that isn't going to be searched on, hidden files or a dedicated .xml file per "real" file isn't a bad route to take. It's readable by basically anything, can be converted to different formats easily, and won't be lost if you decide to change your storage mechanism.
Metadata should help you, not hinder you. I've seen (and been a part of) systems where metadata storage has become more burdensome than storing the actual data, and became a liability. Just keep in mind what you are trying to do with it, and don't over extend yourself with "what ifs."
Plain text has some obvious advantages over anything else. Something like
FileName = 'ferrari.gif'
Title = 'My brand new car'
Tags = 'cars', 'cool'
Related = 'michaelknight.mp3'
Picasa's Picasa.ini files are a good example for this kind of metadata. Also, instead of inventing your own format, XML might be worth considering. There are plenty of readily available DOM processors to deal with this format.
Then again, if the amount of files and relations between them is huge, databases may be better.
I would basically make a metadata DB which held this information:
RESOURCE_TABLE
RESOURCE_ID
RESOURCE_TYPE (folder, doctype, web link, other)
RESOURCE_URL (any URL)
NOTES_TABLE
NOTE_ID
RESOURCE_NO
RESOURCE_NOTE (long text)
TAGS_TABLE
TAG_ID
RESOURCE_NO
TAG_TEXT
Then I would use the note field textual notes to the file/folder/resource. Choose if you would use 1:1 or 1:N for this.
The tags field I would use to store any number of searchable parameters like YEAR, PROJECT, and other values that will describe and group your content.
Then you could add tables for owner, stakeholders, and other organisation info etc.

How do you structure config data in a database?

What is people's prefered method of storing application configuration data in a database. From having done this in the past myself, I've utilised two ways of doing it.
You can create a table where you store key/value pairs, where key is the name of the config option and value is its value. Pro's of this is adding new values is easy and you can use the same routines to set/get data. Downsides are you have untyped data as the value.
Alternatively, you can hardcode a configuration table, with each column being the name of the value and its datatype. The downside to this is more maintenance setting up new values, but it allows you to have typed data.
Having used both, my preferences lie with the first option as its quicker to set things up, however its also riskier and can reduce performance (slightly) when looking up data. Does anyone have any alternative methods?
Update
It's necessary to store the information in a database because as noted below, there may be multiple instances of the program that require configuring the same way, as well as stored procedures potentially using the same values.
You can expand option 1 to have a 3rd column, giving a data-type. Your application can than use this data-type column to cast the value.
But yeah, I would go with option 1, if config files are not an option. Another advantage of option 1 is you can read it into a Dictionary object (or equivalent) for use in your application really easily.
Since configuration typically can be stored in a text file, the string data type should be more than enough to store the configuration values. If you're using a managed language, it's the code that knows what the data type should be, not the database.
More importantly, consider these things with configuration:
Hierarchy: Obviously, configuration will benefit from a
hierarchy
Versioning: Consider the benefit of being able to roll back to the configuration that was in effect at a certain date.
Distribution: Some time, it might be nice to be able to cluster an application. Some properties should probably be local to each node in a cluster.
Documentation: Depending on if you have a web tool or something, it is probably nice to store the documentation about a property close to the code that uses it. (Code annotations is very nice for this.)
Notification: How is the code going to know that a change has been made somewhere in the configuration repository?
Personally, i like an inverted way of handling configuration, where the configuration properties is injected into the modules which don't know where the values came from. This way, the configuration management system can be very complex or very simple depending on your (current) needs.
I use option 1.
My project uses a database table with four columns:
ID [pk]
Scope (default 'Application')
Setting
Value
Settings with a Scope of 'Application' are global settings, such as Maximum number of simultaneous users.
Each module has its own scope based; so our ResultsLoader and UserLoader have different scopes, but both have a Setting named 'inputPath'.
Defaults are either provided in the source code or are injected via our IoC container. If no value is injected or provided in the database, the default from the code is used (if one exists). Therefore, defaults are never stored in the database.
This works out quite well for us. Each time we backup the database we get a copy of the Configuration which is quite handy. The two are always in sync.
It seems overkill to use the DB for config data.
EDIT (sorry too long for comment box):
Of course there's no strict rules on how you implement any part of your program. For the sake of argument, slotted screwdrivers work on some philips screws! I guess I judged too early before knowing what your scenario is.
Relational database excels in massive data store that gives you quick storing, updating, and retrieval, so if your config data is updated and read constantly, then by all means use db.
Another scenario where db may make sense is when you have a server farm where you want your database to store your central config, but then you can do the same with a shared networked drive that point to the xml config file.
XML file is better when your config is hierarchically structured. You can easily organize, locate, and update what you need, and for bonus benefit you can version control the config file along with your source code!
All in all, it all depends on how the config data is used.
That concludes my opinion with limited knowledge of your application. I am sure you can make the right decision.
I guess this is more of a poll, so I'll say the column approach (option 2). However it will depend on how often your config changes, how dynamic it is, and how much data there is, etc.
I'd certainly use this approach for user configurations / preferences, etc.
Go with option 2.
Option 1 is really a way of implenting a database on top of a database, and that is a well-known antipattern, which is just going to give you trouble in the long run.
I can think of at least two more ways:
(a) Create a table with key, string-value, date-value, int-value, real-value columns. Leave unused types NULL.
(b) Use a serialization format like XML, YAML or JSON and store it all in a blob.
Where do you you store the configuration settings your app needs to connect to the database?
Why not store the other config info there too?
I'd go with option 1, unless the number of config options were VERY small (seven or less)
At my company, we're working on using option one (a simple dictionary-like table) with a twist. We're allowing for string substitution using tokens which contain the name of the config variable to be substituted.
For example, the table might contain rows ('database connection string', 'jdbc://%host%...') and ('host', 'foobar'). Encapsulating that with a simple service or stored procedure layer allows for an extremely simple, but flexible, recursive configuration. It supports our need to have multiple isolated environments (dev, test, prod, etc).
I've used both 1 and 2 in the past, and I think they're both terrible solutions. I think Option 2 is better because it allows typing, but it's a lot more ugly than option 1. The biggest problem I have with either is versioning the config file. You can version SQL reasonably well using standard version control systems, but merging changes is usually problematic. Given an opportunity to do this "right", I'd probably create a bunch of tables, one for each type of configuration parameter (not necessarily for each parameter itself), thus getting the benefit of typing and the benefit of the key/value paradigm where appropriate. You can also implement more advanced structures this way, such as lists and hierarchies, which will then be directly queryable by the app instead of having to load the config and then transform it somehow in memory.
I vote for option 2. Easy to understand and maintain.
Option 1 is good for an easily expandable, central storage location. In addition to some of the great column suggestions by folks like RB, Hugo, and elliott, you might also consider:
Include a Global/User setting flag with a user field or even a user/machine field (for machine-specific UI type settings).
Those can, of course, be stored in a local file, but since you are using the database anyway, that makes these available for aliasing a user when debugging - which can be important if the bug is setting related. It also allows an admin to manage setings when necessary.
I use a mix of option 2 and XML columns in SQL server.
You may also wan't to add a check constraint to keep the table at one row.
CREATE TABLE [dbo].[MyOption] (
[GUID] uniqueidentifier CONSTRAINT [dfMyOptions_GUID] DEFAULT newsequentialid() ROWGUIDCOL NOT NULL,
[Logo] varbinary(max) NULL,
[X] char(1) CONSTRAINT [dfMyOptions_X] DEFAULT 'X' NOT NULL,
CONSTRAINT [MyOptions_pk] PRIMARY KEY CLUSTERED ([GUID]),
CONSTRAINT [MyOptions_ck] CHECK ([X]='X')
)
for settings that have no relation to any db tables, i'd probably go for the EAV approach if you need the db to work with the values. otherwise a serialized field value is good if it's really just a store for app code.
but what about a format for a single field to store multiple config settings to be used by the db?
like one field per user that contains all their settings related to their messageboard view (like default sort order, blocked topics, etc.), and maybe another with all their settings for their theme (like text color, bg color, etc.)
Storing hierarchy and documents in a relational DB is madness. Firstly you either have to shred them, only to recombine them at some later stage. Or there bunged inside a BLOB, even more stupid.
Don't use use a relational db for non-relational data, the tool does not fit. Consider something like MongoDB or CouchDB for this. Schema-less no-relational data stores. Store it as JSON if it's coming down the wire in any way to a client, use XML for serverside.
CouchDB gives you versioning out of the box.
Don't store configuration data in a database unless you have a very good reason to. If you do have a very good reason, and are absolutely certain you are going to do it, you should probably store it in a data serialization format like JSON or YAML (not XML, unless you actually need a markup language to configure your app -- trust me, you don't) as a string. Then you can just read the string, and use tools in whatever language you work in to read and modify it. Store the strings with timestamps, and you have a simple versioning scheme with the ability to store hierarchical data in a very simple system. Even if you don't need hierarchical config data, at least now if you need it in the future you won't have to change your config interface to get it. Of course you lose the ability to do relational queries on your config data, but if you're storing that much config data, then you're probably doing something very wrong anyway.
Companies tend to store lots configuration data for their systems in a database, I'm not sure why, I don't think much thought goes into these decisions. I don't see this kind of thing done too often in the OSS world. Even large OSS programs that need lots of configuration like Apache don't need a connection to a database containing an apache_config table to work. Having a huge amount of configuration to deal with in your apps is a bad code smell, storing that data in a database just causes more problems (as this thread illustrates).

Resources