How to store files users upload on website? - database

If a user is going to be uploading files to my website how should I store them. Should I just chuck it all in the server and create a folder with the user's username that contain all their files?
Are there any tutorials you would recommend that explains how to do this?
Thanks!

This is quite a vague question in my opinion and highly dependent on your infrastructure, OS, storage type and storage location and the data types you will be storing. It also depends on the amount of data that your application will be handling and the amount of I/Os to disk that you will be doing.
I'm going to continue with a vague assumption that you will be using S3 buckets to satisfy your purposes and would recommend going over this article by Jeff Bar. The article discusses a few performance tips and tricks for using Amazon S3 which can be helpful in more general scenarios/environments.

Related

Is there any way to download a public DB to hard drive?

I'm a social science researcher, and I'm working with data from various public databases of NGO, government, etc. Let's assume that I've got no opportunity to ask the admins for the whole database. However, if I have enough patience, I'm able to download all the data one-by-one. But the size of the DB makes it almost impossible solving the problem with brute-force.
So, is there any way to download a public DB with all of it's components?
Here's an example:
http://www.trademap.org/tradestat/Country_SelProductCountry_TS.aspx
You can see the Japanese Live animal import (USD) by the importing countries. Is there a faster way to download all the data for every country and products as well than clicking them one-by-one?
Yes, there exist software and web services for scraping. You can find them easily with Google - this is a programming, not a software recommendations site.
Beware that the use of automatic downloading tools may violate the terms of service, and get you into legal trouble. Also, websites may block your access when accessing them too fast.

What will be the best database for video streaming website?

I want to work on video streaming website. But I want to know what will be the best db for video streaming site? I need to help to know about this. Video streaming site like youtube.com, so what will be the best choice ?
Thanks for advance advice ?
The purpose of a database is to record and relate facts and answer questions. You can certainly capture information about videos in a database, like file name and location, title size, width and height, description, content tags, uploader, access permissions, and so on. A DBMS is an excellent tool for managing all the knowledge you need to make the site work and be useful.
The videos themselves are best served from a file system rather than from a database - most DBMSs are optimized for large sets of small values and don't have dedicated data types or operators for videos, let alone support for modern codecs and advanced video manipulation. In contrast, a LOT of software is available for processing videos stored as files.

serve huge static files with horizontal scale

I hope I can found a distributed filesystem which is easy to configure, easy to use, easy to learn.
Any one can help on this?
As the details relating the usage is not mentioned and as much i can infer from the question, you must try MogileFS (Easy in setting it up and maintaining). Its is from the maker's of memcached and is used to server images etc.
Please refer to the below mentioned link for better explanation.
http://code.google.com/p/mogilefs/
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
I suggest you consider of using Apache Hadoop. It has a lot of services and technologies to work with (Cassandra, HBase, etc). Quote from official site:
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Basically, Hadoop is a large framework. You can use Karmasphere studio with Hadoop. I suppose, with its help you can learn Hadoop much quicker get deeper into distibuted systems.
About HDFS: read the article "GridGain and Hadoop". Short quote from there:
today HDFS is probably the most economical way to keep very large static data set of TB and PB scale in distributed file system for a long term storage
Check out Amazon Simple Storage Service (Amazon S3).
It has (practically) unlimited storage, 100% uptime and tick most of the boxes needed for most situations. It isn't free, but is very cheap considering what you get.

Organizing lots of file uploads

I'm running a website that handles multimedia uploads for one of its primary uses.
I'm wondering what are the best practices or industry standard for organizing alot of user uploaded files on a server.
Your question is exceptionally broad, but I'll assume you are talking about storage/organisation/hierarchy of the files (rather than platform/infrastructure).
A typical approach for organisation is to upload files to a 3 level hierarchical structure based on the filename itself.
Eg. Filename = "My_Video_12.mpg"
Which would then be stored in,
/M/Y/_/My_Video_12.mpg
Or another example, "a9usfkj_0001.jpg"
/a/9/u/a9usfkj_0001.jpg
This way, you end up with a manageable structure that makes it easy to locate a file's location simply based on its name. It also ensures that directories do not grow to a huge scale and become incredibly slow to access.
Just an idea, but it might be worth being more explicit as to what your question is actually about.
I don't think you are going get any concrete answers unless you give more context and describe what the use-case are for the files. Like any other technology decision, the 'best practice' is always going to be a compromise between the different functional and non-functional requirements, and as such the question needs a lot more context to yield answers that you can go and act upon.
Having said that, here are some of the strategies I would consider sound options:
1) Use the conventions dictated by the consumer of the files.
For instance, if the files are going to be used by a CMS/publishing solution, that system probably has some standardized solution for handling files.
2) Use a third party upload solution. There are a bunch of tools that can help guide you to a solution that solves your specific problem. Tools like Transloadit, Zencoder and Encoding all have different options for handling uploads. Having a look at those options should give you and idea of what could be considered "industry standard".
3) Look at proved solutions, and mimic the parts that fit your use-case. There are open-source solutions that handles the sort of things you are describing here. Have a look at the different plugins to for example paperclip, to learn how they organize files, or more importantly, what abstractions do they provide that lets you change your mind when the requirements change.
4) Design your own solution. Do a spike, it's one of the most efficient ways of exposing requirements you haven't thought about. Try integrating one of the tools mentioned above, and see how it goes. Software is soft, so no decision is final. Maybe the best solution is to just try something, and change it when it doesn't fit anymore.
This is probably not the concrete answer you were looking for, but like I mentioned in the beginning, design decisions are always a trade-off, "best-practice" in one context could be the worst solution in another context :)
Best off luck!
From what I understand you want a suggestion on how to store the files. If is that what you want, I would suggest you to have 2 different storage systems for your files.
The first storage would be a place to store the physical file, like a directory on your server (w/o FTP enabled, accessible or not to browsers, ...) or go for Amazon s3 (aws.amazon.com/en/s3/), Rackspace CloudFiles (www.rackspace.com/cloud/cloud_hosting_products/files/) or any other storage solution (you can even choose dropbox, if you want). All of these options offers APIs to save/retrieve the files.
The second storage would be a database, to index and control the files. On the DB, that could be MySQL, MSSQL or a non-relational database, like Amazon DynamoDB or SimpleSQL, you set the link to you file (http link, the path to the file or anything like this).
Also, on the DB you can control and store any metadata of the file you want and choose one or many #ebaxt's solutions to get it. The metadata can be older versions of the file, the words of a text file, the camera-model and geo-location of a picture, etc. Of course it depends on your needs and how it will be really used. You have a very large number of options, but without more info of what you intend to do is hard to suggest you a solution.
On Amazon tutorials area (http://aws.amazon.com/articles/Amazon-S3?browse=1) you can find many papers about it, like Netflix's Transition to High-Availability Storage Systems, Using the Java Persistence API with Amazon SimpleDB and Petboard: An ASP.NET Sample Using Amazon S3 and Amazon SimpleDB
Regards.

Common Interface for CouchDB and Amazon S3

I just read tons of material on Amazon's S3 and CouchDB. Maybe not enough yet though, so here is my question:
Both systems sound very appealing to me. CouchDB is distributed using the Apache License V2 and with Amazon's S3, you pay per stored megabyte and the traffic you generate. So there is a bit of a difference monetarily.
But from a technical point of view, from what I understood, both systems help you at storing unstructured data of arbitrary sizes (depending on the underlying OS as I understand from CouchDB).
I don't know how easy it would be to come up with a unified interface for both of them, so that you could just change your "datastore provider" as the need arises? Not having to change any of your code.
I also don't know if this is technically easily feasible, haven't looked at their protocols yet in great detail. But it would be great to postpone the provider decision to as late as possible.
Also this could be interesting for integration testing purposes: You could for example test against a local CouchDB instance and run your code against S3 for production use.
To formulate my question from a different angle: Is Amazon's S3 and CouchDB essentially solving the exact same thing or is this insane and I missed the whole point?
Updated Question
After Jim's brilliant answer, let me then rephrase the question to:
"Common Interface for CouchDB and Amazon SimpleDB"
And following the same lines of thinking, do you see a problem with a common interface between CouchDB and SimpleDB then?
You're missing the point, just slightly. CouchDB is a database. S3 is a filesystem. They're both relatively unstructured, but with S3 you're storing files under keys while with CouchDB you're storing (arbitrarily-structured) data under keys.
The Amazon Web Services analogue to something like CouchDB would be Amazon SimpleDB.
Something like what you're looking for already exists for Ruby, and it's called Moneta. It even can store stuff on S3, which may be exactly what you want.
You are wrong Jim. S3 is not a filesystem. It is a webservice for a key-value store.
Amazon provides you with a key. Yes, the value of that key can be data that represents a file. But, how that gets managed in the Amazon system is something entirely different. It can be stored in one node, multiple nodes, geographically strategic nodes with cloudfront, and so on. There is nothing in that key in and of itself that indicates how the system will manage the file. The value of the key is never a file directly. It is data that represents the file. How that value gets eventually resolved into a file that the client receives is entirely separate.
The value of that key can actually be data that does not represent a file. It can be a JSON dictionary. In that sense, S3 could be used in the same way as CouchDB.
So I don't think the question is missing the point. In fact, it is a perfectly legitimate question as data in CouchDB is not distributed amongst nodes. And that could hamper performance.
Let's not even talk about Amazon SimpleDB. That is something separate. Please don't mix terms and then make claims based on that.
If you are not convinced by this claim, and if people request it, I am happy to provide a code bit that illustrates a JSON dictionary in S3.
I respect your answers to other questions Jim. But, here, you are clearly wrong and cannot see how those points are justified.
Technically a common layer is possible. However I question that this would make sense.
Couchdb has integrated map/reduce functions for your documents which are exposed as "views". I don't think SimpleDB has anything like that. On the other hand SimpleDB has query expressions which Couchdb has not. Of coure you can model those expressions as a view in Couchdb if you know your query at development time.
Beside that the common function is not more than create/update/delete a key-document pair.
This is an old question, but it comes up first in my searches for this.
PouchDB is a fully compliant CouchDB interface, and uses LevelDown for its backend. It is capable of using any LevelDown compliant interface.
S3 LevelDown is a Level Down compliant interface that uses S3 as its backend store.
https://github.com/loune/s3leveldown
In theory, you could put them together to create a full CouchDB style implementation, backed by S3.
https://github.com/loune/s3leveldown/tree/master/examples/pouchdb

Resources