Organizing lots of file uploads

Organizing lots of file uploads - file

I'm running a website that handles multimedia uploads for one of its primary uses.
I'm wondering what are the best practices or industry standard for organizing alot of user uploaded files on a server.

Your question is exceptionally broad, but I'll assume you are talking about storage/organisation/hierarchy of the files (rather than platform/infrastructure).
A typical approach for organisation is to upload files to a 3 level hierarchical structure based on the filename itself.
Eg. Filename = "My_Video_12.mpg"
Which would then be stored in,
/M/Y/_/My_Video_12.mpg
Or another example, "a9usfkj_0001.jpg"
/a/9/u/a9usfkj_0001.jpg
This way, you end up with a manageable structure that makes it easy to locate a file's location simply based on its name. It also ensures that directories do not grow to a huge scale and become incredibly slow to access.
Just an idea, but it might be worth being more explicit as to what your question is actually about.

I don't think you are going get any concrete answers unless you give more context and describe what the use-case are for the files. Like any other technology decision, the 'best practice' is always going to be a compromise between the different functional and non-functional requirements, and as such the question needs a lot more context to yield answers that you can go and act upon.
Having said that, here are some of the strategies I would consider sound options:
1) Use the conventions dictated by the consumer of the files.
For instance, if the files are going to be used by a CMS/publishing solution, that system probably has some standardized solution for handling files.
2) Use a third party upload solution. There are a bunch of tools that can help guide you to a solution that solves your specific problem. Tools like Transloadit, Zencoder and Encoding all have different options for handling uploads. Having a look at those options should give you and idea of what could be considered "industry standard".
3) Look at proved solutions, and mimic the parts that fit your use-case. There are open-source solutions that handles the sort of things you are describing here. Have a look at the different plugins to for example paperclip, to learn how they organize files, or more importantly, what abstractions do they provide that lets you change your mind when the requirements change.
4) Design your own solution. Do a spike, it's one of the most efficient ways of exposing requirements you haven't thought about. Try integrating one of the tools mentioned above, and see how it goes. Software is soft, so no decision is final. Maybe the best solution is to just try something, and change it when it doesn't fit anymore.
This is probably not the concrete answer you were looking for, but like I mentioned in the beginning, design decisions are always a trade-off, "best-practice" in one context could be the worst solution in another context :)
Best off luck!

From what I understand you want a suggestion on how to store the files. If is that what you want, I would suggest you to have 2 different storage systems for your files.
The first storage would be a place to store the physical file, like a directory on your server (w/o FTP enabled, accessible or not to browsers, ...) or go for Amazon s3 (aws.amazon.com/en/s3/), Rackspace CloudFiles (www.rackspace.com/cloud/cloud_hosting_products/files/) or any other storage solution (you can even choose dropbox, if you want). All of these options offers APIs to save/retrieve the files.
The second storage would be a database, to index and control the files. On the DB, that could be MySQL, MSSQL or a non-relational database, like Amazon DynamoDB or SimpleSQL, you set the link to you file (http link, the path to the file or anything like this).
Also, on the DB you can control and store any metadata of the file you want and choose one or many #ebaxt's solutions to get it. The metadata can be older versions of the file, the words of a text file, the camera-model and geo-location of a picture, etc. Of course it depends on your needs and how it will be really used. You have a very large number of options, but without more info of what you intend to do is hard to suggest you a solution.
On Amazon tutorials area (http://aws.amazon.com/articles/Amazon-S3?browse=1) you can find many papers about it, like Netflix's Transition to High-Availability Storage Systems, Using the Java Persistence API with Amazon SimpleDB and Petboard: An ASP.NET Sample Using Amazon S3 and Amazon SimpleDB
Regards.

Related

What are Advantages to Content Repositories (not talking about CMS's)

Given that a lot of people use content repositories. There must be a good reason. I'm building out a new web application that will need to store content. Can someone help me understanding this?
What are the advantages to using a content repository like Apache Jackrabbit as opposed to writing your own code/API to store images or text pages? Writing your own requires time etc. but so too does implementing and learning a new framework like the content repository API. A benefit to rolling your own seems to me that you know your code and have immediate expertise if you need to enhance or fix it. Using another framework you need to learn its foibles, and it is always easier to modify code you know that don't know... i.e. you don't know that underlying framework code as well as your own.
As I said a lot of people use them. There must be a reason. I can't see it as being just another "everyone is using them so, so should we." At least I hope it isn't that. :)

A JCR repository allows you to store all your content (from structured database-type data to large multimedia files) in a single place and with a single API, which is extremely convenient and makes your code simpler, avoiding the impedance mismatch between files and data that you usually have in content-based systems.
JCR also provides a lot of infrastructure functionality that you won't have to build or assemble yourself: search (including full-text), observation (callbacks when something changes) versioning, data types including multi-value, ordered nodes, etc...
If you allow a shameless plug, my "JCR - best of both worlds" article at http://java.dzone.com/articles/java-content-repository-best describes this in more detail and also provides a reading list for the JCR spec, that should allow you go get a good overview without reading the whole thing.
The article uses Apache Sling for its examples, which combined with a JCR repository provides a very nice (IMO, but as a Sling committer I'm biased ;-) platform for content-based applications.

My most recent projects have involved both choices: a custom-built data store (MySQL and image files) wtih a multi-level caching mechanism, and a JCR-based commercial repository.
A few thoughts:
In the short run, a DIY solution offers reduced complexity: you only have to build and learn what you need. And there is at least the opportunity to optimize
the data store for your particular application's needs -- more than likely speed of retrieval, but possibly storage footprint, security, or reliability concerns are foremost for you.
However, in the long run, you're looking at a significant increment of work to extend the home-grown system to a new content type (video, e.g.) or to provide new functionality (maybe,
versioning).
Also, it's difficult to separate the choice of a data store approach from the choice of tools that content providers will use to populate and maintain the data store. You'll have to give
your authors something more than an HTML form with a textarea and a submit button.

This is related to the advantages of standardization: compatibility and interchangeability. If everybody writes his own library and API, there is no compatibility and interchangeability, leading to higher cost.

Distributed database management system - alternatives?

I am working to develop an application that needs data distributed across countries. Content will be supplied "per region", but needs to be able to be easily copied to another region. On top of this I have general information that needs to be shared and synchronized across the databases.
The organisation I work for is considering implementing this system themselves, but it feels like there should be some good solutions out there already (I am open to cloud solutions - the less my company needs to manage the better)?
This might be a vague question, but I think it is possible to answer it well.
What are my options when developing this kind of distributed data system?
Update:
Should have elaborated (but I'm not sure how much I can say given NDA). Suffice to say, I have "Content" which I need stored on some space (files). I need metadata stored about the content distributed over several nodes (that might be hosted by us or some one else) to allow fast-paced communication and regionalized differences in data. I need to control HOW data is replicated between nodes, but preferably in a standards compliant way. (Preferably not written by us)

You can try CouchDB. Its off-line replication model sounds like a good fit for geo distributed system.

Interesting question - but it would really help to get more context.
You talk about "data", which usually means something with a fairly well-defined structure, often implemented in a relational database.
You also talk about "content", which usually means something with a (much) less well-defined structure, often implemented as a document of some type. Many solutions exist for structuring "documents", e.g. file systems or web sites.
Assuming we are talking about structured data, the simplest thing to do is have single repository, accessible everywhere. Have a look at "cloud" offerings - Amazon's a good bet. Creating your own global data repository is a significant undertaking - but if you're dealing with highly confidential data, or have specific performance requirements, it may the way to go.
If neither of those options work, you're in the world of "enterprise service bus". Google it, but be careful - it's a complex field, and you really want to find someone who knows what they're doing.
Having said that, using an off the shelf ESB is many times less painful than building your own distributed data structure.

I know it's years after asking, but I was looking up the answer to the same question and it looks like Cassandra may fit the bill. Once setup, it looks and acts like other database solutions (Tables, Views, SQL, Transactions, etc.), but it can also be entirely decentralized. Each instance acts as a node in a cluster of other Cassandra nodes. They synchronize behind the scenes and if one goes down, the others pick up the slack. This makes Cassandra both highly scalable and highly fault tolerant.

NoSQL for filesystem storage organization and replication?

We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?

Have you had a look at MongoDB's GridFS.
http://www.mongodb.org/display/DOCS/GridFS+Specification
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.

Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.

For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?

How would you build a database filesystem (DBFS)?

A database file system is a file system that is a database instead of a hierarchy. Not too complex an idea initially but I thought I'd ask if anyone has thought about how they might do something like this? What are the issues that a simple plan is likely to miss? My first guess at an implementation would be something like a filesystem to for a Linux platform (probably atop an existing file system) but I really don't know much about how that would be started. Its a passing thought that I doubt I'd ever follow through on but I'm hoping to at least satisfy my curiosity.

DBFS is a really nice PoC implementation for KDE. Instead of implementing it as a file system directly, it is based on indexing on a traditional file system, and building a new user interface to make the results accessible to users.

The easiest way would be to build it using fuse, with a database back-end.
A more difficult thing to do is to have it as a kernel module (VFS).
On Windows, you could use IFS.

I'm not really sure what you mean with "A database file system is a file system that is a database instead of a hierarchy".
Probably, using "Filesystem in Userspace" (FUSE), as mentioned by Osama ALASSIRY, is a good idea. The FUSE wiki lists a lot of existing projects about databased-backed filesystems as well as filesystems in which you can search by SQL-like queries.

Maybe this is a good starting point for getting an idea how it could work.
It's a basic overview of the Firebird architecture.
Firebird is an opensource RDBMS, so you can have a real deep insight look, too, if you're interested.

Its been a while since you asked this. I'm surprised no one suggested the obvious. Look at mainframes and minis, especially iSeries-OS (now called IBM-i used to be called iOS or OS/400).
How to do an relational database as a mass data store is relatively easy. Oracle and MySQL both have these. The catch is it must be essentially ubiquitous for end user applications.
So the steps for an app conversion are:
1) Everything in a normal hierarchical filesystem
2) Data in BLOBs with light metadata in the database. File with some catalogue information.
3) Large data in BLOBs with extensive metadata and complex structures in the database. File with substantial metadata associated with it that can be essentially to understanding the structure.
4) Internal structures of the BLOB exposed in an object <--> Relational map with extensive meta-data. While there may be an exportable form, the application naturally works with the database, the notion of the file as the repository is lost.

Storing a file in a database as opposed to the file system?

Generally, how bad of a performance hit is storing a file in a database (specifically mssql) as opposed to the file system? I can't come up with a reason outside of application portability that I would want to store my files as varbinaries in SQL Server.

Have a look at this answer:
Storing Images in DB - Yea or Nay?
Essentially, the space and performance hit can be quite big, depending on the number of users. Also, keep in mind that Web servers are cheap and you can easily add more to balance the load, whereas the database is the most expensive and hardest to scale part of a web architecture usually.
There are some opposite examples (e.g., Microsoft Sharepoint), but usually, storing files in the database is not a good idea.
Unless possibly you write desktop apps and/or know roughly how many users you will ever have, but on something as random and unexpectable like a public web site, you may pay a high price for storing files in the database.

If you can move to SQL Server 2008, you can take advantage of the FILESTREAM support which gives you the best of both - the files are stored in the filesystem, but the database integration is much better than just storing a filepath in a varchar field. Your query can return a standard .NET file stream, which makes the integration a lot simpler.
Getting Started with FILESTREAM Storage

I'd say, it depends on your situation. For example, I work in local government, and we have lots of images like mugshots, etc. We don't have a high number of users, but we need to have good security and auditing around the data. The database is a better solution for us since it makes this easier and we aren't going to run into scaling problems.

What's the question here?
Modern DBMS SQL2008 have a variety of ways of dealing with BLOBs which aren't just sticking in them in a table. There are pros and cons, of course, and you might need to think about it a little deeper.
This is an interesting paper, by the late (?) Jim Gray
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem

In my own experience, it is always better to store files as files. The reason is that the filesystem is optimised for file storeage, whereas a database is not. Of course, there are some exceptions (e.g. the much heralded next-gen MS filesystem is supposed to be built on top of SQL server), but in general that's my rule.

While performance is an issue, I think modern database designs have made it much less of an issue for small files.
Performance aside, it also depends on just how tightly-coupled the data is. If the file contains data that is closely related to the fields of the database, then it conceptually belongs close to it and may be stored in a blob. If it contains information which could potentially relate to multiple records or may have some use outside of the context of the database, then it belongs outside. For example, an image on a web page is fetched on a separate request from the page that links to it, so it may belong outside (depending on the specific design and security considerations).
Our compromise, and I don't promise it's the best, has been to store smallish XML files in the database but images and other files outside it.

We made the decision to store as varbinary for http://www.freshlogicstudios.com/Products/Folders/ halfway expecting performance issues. I can say that we've been pleasantly surprised at how well it's worked out.

I agree with #ZombieSheep.
Just one more thing - I generally don't think that databases actually need be portable because you miss all the features your DBMS vendor provides. I think that migrating to another database would be the last thing one would consider. Just my $.02

The overhead of having to parse a blob (image) into a byte array and then write it to disk in the proper file name and then reading it is enough of an overhead hit to discourage you from doing this too often, especially if the files are rather large.

Not to be vague or anything but I think the type of 'file' you will be storing is one of the biggest determining factors. If you essentially talking about a large text field which could be stored as file my preference would be for db storage.

Interesting topic.
There is no absolutely one correct answer to this question.
There are few key elements to consider:
What’s your database engine?
What’s the route of file from database to end user and/or backwards?
What are the security requirements?
If files are meant for public audience and accessible via website, you shouldn’t even consider storing files in database. Use some smart indexing for files instead.
If files are containing highly sensitive information, then it might be worth of storing these into database. But you have to implement proper safe gateways too.
If performance is crucial, it’s better do not store files in database.
Backup and restoring and migrating of database might become a nightmare if database grows big just because of files. If you are DBA, then you would like to kill the person who “invented” an idea to put files into database.
I recommend to use storing files into database at last option, when there is absolutely no any better alternative available.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight