cross-platform frameworks for storage + metadata? - database

I don't quite know what to use for terminology, so bear with me...
Are there any cross-platform frameworks out there that facilitate a kind of "virtual file storage" to encapsulate adding files along with a database of metadata? I'm thinking about something along the lines of iTunes or iPhoto, where the program manages a whole bunch of files (in those cases audio or image files) and has a database of metadata so you can organize/find those files easily. I'd like to cobble together something along those lines for files in general.
edit: I am hesitant to store files in a database alone, e.g. MySQL, as there would be potentially tens of gigabytes in my application (this issue has been mentioned in several SO posts, see this one that gives several links to others). I'm looking at CouchDB though and maybe it has promise....

Why not a database? You can keep files inside a database.

Related

What do programs use to store data? (do they use a known database system?)

I was thinking - let's take a look at a computer game of any kind, or any program in general.
(Chrome, Skype, Warcraft,...)
They need to save some things that a user wanted them to save.
How do they do it?
Do they save it in a simple text file, or do they pack a database system (like MySQL,...) with themselves?
That really depends on your needs. If you only need to store some key value pairs, an application can use a simple text file (e.g. an *.ini file) That however is a plain text file readable by everybody.
An application can of course also use a database like MySql, MS SQL. However, these are not very handy if you want to distribute your application as they run as a seperate service on a server and need to be installed seperately. Then, there are databases like Sqlite which is also a SQL database, but which stores everything inside a single file. Your application just needs a way to interact with this file.
Yet another way would be to serialize/deserialize an object which holds your data you want to store.
There are other ways to store data, like NoSQL databases. I personally haven't used one of those yet, but here is a listing of some of them: http://nosql-database.org/
XML could also be used.
There are endless way an application can store its data
There is literally no end to the ways programs will store data. OTOH:
home-made archive formats: every game company seems to have a few of their own (Blizzard MoPaQ,
XML files: usually used for simple configuration (Apple's plist files, Windows application configurations, Skype's user preferences, ...)
SQLite databases: usually used for larger amounts of personal data (Firefox: bookmarks, history, etc.; iOS personal information databases, etc.)
"In the cloud" in someone else's database (basically all web apps)
Plain text or simple text formats (Windows .ini/.inf, Java MANIFEST.MF, YAML, etc.)
...
A single program might use multiple methods depending on what they're storing. There is no unified solution, and there is no one solution that is right for every task since every system has tradeoffs (human-readability vs. packing efficiency, random access vs. sequential archive, etc.)
A lot of programs use Sqlite to store data (http://www.sqlite.org). Sqlite is a very compact cross platform SQL database. Many programs do use text files.

Organizing lots of file uploads

I'm running a website that handles multimedia uploads for one of its primary uses.
I'm wondering what are the best practices or industry standard for organizing alot of user uploaded files on a server.
Your question is exceptionally broad, but I'll assume you are talking about storage/organisation/hierarchy of the files (rather than platform/infrastructure).
A typical approach for organisation is to upload files to a 3 level hierarchical structure based on the filename itself.
Eg. Filename = "My_Video_12.mpg"
Which would then be stored in,
/M/Y/_/My_Video_12.mpg
Or another example, "a9usfkj_0001.jpg"
/a/9/u/a9usfkj_0001.jpg
This way, you end up with a manageable structure that makes it easy to locate a file's location simply based on its name. It also ensures that directories do not grow to a huge scale and become incredibly slow to access.
Just an idea, but it might be worth being more explicit as to what your question is actually about.
I don't think you are going get any concrete answers unless you give more context and describe what the use-case are for the files. Like any other technology decision, the 'best practice' is always going to be a compromise between the different functional and non-functional requirements, and as such the question needs a lot more context to yield answers that you can go and act upon.
Having said that, here are some of the strategies I would consider sound options:
1) Use the conventions dictated by the consumer of the files.
For instance, if the files are going to be used by a CMS/publishing solution, that system probably has some standardized solution for handling files.
2) Use a third party upload solution. There are a bunch of tools that can help guide you to a solution that solves your specific problem. Tools like Transloadit, Zencoder and Encoding all have different options for handling uploads. Having a look at those options should give you and idea of what could be considered "industry standard".
3) Look at proved solutions, and mimic the parts that fit your use-case. There are open-source solutions that handles the sort of things you are describing here. Have a look at the different plugins to for example paperclip, to learn how they organize files, or more importantly, what abstractions do they provide that lets you change your mind when the requirements change.
4) Design your own solution. Do a spike, it's one of the most efficient ways of exposing requirements you haven't thought about. Try integrating one of the tools mentioned above, and see how it goes. Software is soft, so no decision is final. Maybe the best solution is to just try something, and change it when it doesn't fit anymore.
This is probably not the concrete answer you were looking for, but like I mentioned in the beginning, design decisions are always a trade-off, "best-practice" in one context could be the worst solution in another context :)
Best off luck!
From what I understand you want a suggestion on how to store the files. If is that what you want, I would suggest you to have 2 different storage systems for your files.
The first storage would be a place to store the physical file, like a directory on your server (w/o FTP enabled, accessible or not to browsers, ...) or go for Amazon s3 (aws.amazon.com/en/s3/), Rackspace CloudFiles (www.rackspace.com/cloud/cloud_hosting_products/files/) or any other storage solution (you can even choose dropbox, if you want). All of these options offers APIs to save/retrieve the files.
The second storage would be a database, to index and control the files. On the DB, that could be MySQL, MSSQL or a non-relational database, like Amazon DynamoDB or SimpleSQL, you set the link to you file (http link, the path to the file or anything like this).
Also, on the DB you can control and store any metadata of the file you want and choose one or many #ebaxt's solutions to get it. The metadata can be older versions of the file, the words of a text file, the camera-model and geo-location of a picture, etc. Of course it depends on your needs and how it will be really used. You have a very large number of options, but without more info of what you intend to do is hard to suggest you a solution.
On Amazon tutorials area (http://aws.amazon.com/articles/Amazon-S3?browse=1) you can find many papers about it, like Netflix's Transition to High-Availability Storage Systems, Using the Java Persistence API with Amazon SimpleDB and Petboard: An ASP.NET Sample Using Amazon S3 and Amazon SimpleDB
Regards.

Finding databases for use in applications

Does anyone have some recommendations on how I can find databases for random things that I might want to use in my application. For example, a database of zip code locations, area code cities, car engines, IP address locations, food calorie counts, book list, or whatever. I'm just asking generally when you decide you need a bunch of data where are some good places to start looking other than google?
EDIT:
Sorry if I was unclear. It doesn't necessarily need to be a database. Just something I could dump into a relational database like a CSV file. I was just listing some examples of things I've needed in the past. I always find myself searching all over Google for these types of things and am trying to find a few places to look first. Are there companies that just collect tons of data on things and sell it or give it away?
Bureau of Labor Statistics
The Zip Code Database Project
US Census Bureau (You can use their interface to create and download custom CSV files)
Data.gov has tons of stuff
You don't necessarily need a database to get going. A lot of the kinds of information you're listing will exist in delimited files (such as CSVs.) Any structured text file will do since importing woth most major database engines is somewhat trivial. In fact, the raw data will most likely not exist in a db for that reason. E.g. you can then imported into the RDBMS of your choice and the provider of that data does not need to worry about a bunch of different db formats.
Zip Code CSV file at SourceForge
List of english words (not CSV per se, but just as easy to import... for things like spell checkers)
Really good looking source of a few different datasets
If you want to track the users position this http://code.google.com/apis/gears/api_geolocation.html might be a better way than a lookup table or a csv file.

NoSQL for filesystem storage organization and replication?

We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?
Have you had a look at MongoDB's GridFS.
http://www.mongodb.org/display/DOCS/GridFS+Specification
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.
Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.
For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?

How would you build a database filesystem (DBFS)?

A database file system is a file system that is a database instead of a hierarchy. Not too complex an idea initially but I thought I'd ask if anyone has thought about how they might do something like this? What are the issues that a simple plan is likely to miss? My first guess at an implementation would be something like a filesystem to for a Linux platform (probably atop an existing file system) but I really don't know much about how that would be started. Its a passing thought that I doubt I'd ever follow through on but I'm hoping to at least satisfy my curiosity.
DBFS is a really nice PoC implementation for KDE. Instead of implementing it as a file system directly, it is based on indexing on a traditional file system, and building a new user interface to make the results accessible to users.
The easiest way would be to build it using fuse, with a database back-end.
A more difficult thing to do is to have it as a kernel module (VFS).
On Windows, you could use IFS.
I'm not really sure what you mean with "A database file system is a file system that is a database instead of a hierarchy".
Probably, using "Filesystem in Userspace" (FUSE), as mentioned by Osama ALASSIRY, is a good idea. The FUSE wiki lists a lot of existing projects about databased-backed filesystems as well as filesystems in which you can search by SQL-like queries.
Maybe this is a good starting point for getting an idea how it could work.
It's a basic overview of the Firebird architecture.
Firebird is an opensource RDBMS, so you can have a real deep insight look, too, if you're interested.
Its been a while since you asked this. I'm surprised no one suggested the obvious. Look at mainframes and minis, especially iSeries-OS (now called IBM-i used to be called iOS or OS/400).
How to do an relational database as a mass data store is relatively easy. Oracle and MySQL both have these. The catch is it must be essentially ubiquitous for end user applications.
So the steps for an app conversion are:
1) Everything in a normal hierarchical filesystem
2) Data in BLOBs with light metadata in the database. File with some catalogue information.
3) Large data in BLOBs with extensive metadata and complex structures in the database. File with substantial metadata associated with it that can be essentially to understanding the structure.
4) Internal structures of the BLOB exposed in an object <--> Relational map with extensive meta-data. While there may be an exportable form, the application naturally works with the database, the notion of the file as the repository is lost.

Resources