What will be the best database for video streaming website? - database

I want to work on video streaming website. But I want to know what will be the best db for video streaming site? I need to help to know about this. Video streaming site like youtube.com, so what will be the best choice ?
Thanks for advance advice ?

The purpose of a database is to record and relate facts and answer questions. You can certainly capture information about videos in a database, like file name and location, title size, width and height, description, content tags, uploader, access permissions, and so on. A DBMS is an excellent tool for managing all the knowledge you need to make the site work and be useful.
The videos themselves are best served from a file system rather than from a database - most DBMSs are optimized for large sets of small values and don't have dedicated data types or operators for videos, let alone support for modern codecs and advanced video manipulation. In contrast, a LOT of software is available for processing videos stored as files.

Related

How to store files users upload on website?

If a user is going to be uploading files to my website how should I store them. Should I just chuck it all in the server and create a folder with the user's username that contain all their files?
Are there any tutorials you would recommend that explains how to do this?
Thanks!
This is quite a vague question in my opinion and highly dependent on your infrastructure, OS, storage type and storage location and the data types you will be storing. It also depends on the amount of data that your application will be handling and the amount of I/Os to disk that you will be doing.
I'm going to continue with a vague assumption that you will be using S3 buckets to satisfy your purposes and would recommend going over this article by Jeff Bar. The article discusses a few performance tips and tricks for using Amazon S3 which can be helpful in more general scenarios/environments.

Web data extraction and data mining; Scraping vs Injection and how to get data.. like yesterday

I feel like i should almost give a friggin synopsis to this/these lengthy question(s)..
I apologize if all of these questions have been answered specifically in a previous question/answer post, but I have been unable to locate any that specifically addresses all of the following queries.
This question involves data extraction from the web (ie web scraping, data mining etc). I have spent almost a year doing research into these fields and how it can be applied to a certain industry. I have also familiarized myself with php and mysql/myphpmyadmin.
In a nutshell I am looking for a way to extract information from a site (probably several gigs worth) as fast and efficiently as possible. I have tried web scraping programs like scrapy and webharvey. I have also experimented with programs like HTTrack. All have their strengths and weaknesses. I have found that webharvey works pretty good yet it has its limitations when scraping images that are stored in gallery widgets. Also I find that many of the sites I am extracting from use other methods to make mining data a pain. It would take months to extract the data using webharvey. Which I can't complain given that I'd be extracting millions of rows worth of data exported in csv format into excel. But again, images and certain ajax widgets throw the program off when trying to extract image files.
So my questions are as follows:
Are there any quicker ways to extract said data?
Is there any way to get around the webharvey image limitations (ie only being able to extract one image within a gallery widget / not being able to follow sub-page links on sites that embed their crap funny and try to get cute with coding)?
Are their any ways to bypass site search form parameters that limit the number of search results (ie obtaining all business listings within an entire state instead of being limited to a county per search form restrictions)**
Also, this is public information so therefore it cannot be copyrighted; anybody can take it :) (case in point: Feist Publications v. Rural Telephone Service). Extracting information is extracting information. Its legal to extract as long as we are talking facts/public information.
So with that said, wouldn't the most efficient method (grey area here) of extracting this "public" information (assuming vulnerabilities existed), be through the use of sql injection?... If one was so inclined? :)
As a side question just how effective is Tor at obscuring ones IP address? Lol
Any help, feedback, suggestions or criticism would be greatly appreciated. I am by no means an expert in any of the above mentioned fields. I am just a motivated individual with a growing interest in programming and automation who has a lot of crazy ideas. Thank you.
You may be better off writing your own Linux command-line scraping program using either a headless browser library like PhantomJS (JavaScript), or a test framework like Selenium WebDriver (Java).
Once you have your scrape program completed, you can then scale it up by installing it on a cloud server (e.g. Amazon EC2, Linode, Google Compute Engine or Microsoft Azure) and duplicating the server image to as many are required.

Which database does Youtube use at the moment?

I hope anyone can help me out in this topic, even if it's not a specific programming question.
I'm writing a bachelor thesis, where I compare MySQL to MongoDB and I want to write something about Youtube, as the platform has to handle many requests with heavy dataload.
The only good resource which I found was this video: Seattle Conference on Scalability: YouTube Scalability
As the conference was in 2007, I can imagine there were some updates regarding to the database.
The last information that I have from this talk is that the thumbnails are stored in a BigTable database and the metadata in MySQL. Are there any changes since then?
Where are the videos stored? Is there an entry in the MySQL table, which refers to the stored video?
Thanks in advance for the answer!
According to this, youtube still uses mysql: http://code.google.com/p/vitess/wiki/ProjectGoals
I am not sure of how things are at youtube but I am in process of developing a similar application for our client. So what we are doing is we are making the use of best of both worlds i.e SQL and NoSQL..
We store the videos on disk and store the path to these videos in MySQL db table. Then we have a separate table which holds the genre and video mapping i.e which video belongs to which particular genre.
Today with vast of pool of user data we are in position to leverage upon these data like we had never been before, so you see things are now way different then 2007 and with the popularity and dependency of people on internet when it comes to sites like you tube we have vast set of unstructured data which if used properly can give you great results. So in our project we store the site admin and reporting stuff like user db, video locations and genre mapping etc in MySQL and store the unstructured data about user interaction in NoSQL database. We then use the NoSQL data to do all the analytics and give appropriate results to the user.
They are using mysql with Bigdata.
The user information such has who uploaded the file,file information all will be stored in mysql and data will be stored in Bigdata.
I think they are using database that can use FileTable

Organizing lots of file uploads

I'm running a website that handles multimedia uploads for one of its primary uses.
I'm wondering what are the best practices or industry standard for organizing alot of user uploaded files on a server.
Your question is exceptionally broad, but I'll assume you are talking about storage/organisation/hierarchy of the files (rather than platform/infrastructure).
A typical approach for organisation is to upload files to a 3 level hierarchical structure based on the filename itself.
Eg. Filename = "My_Video_12.mpg"
Which would then be stored in,
/M/Y/_/My_Video_12.mpg
Or another example, "a9usfkj_0001.jpg"
/a/9/u/a9usfkj_0001.jpg
This way, you end up with a manageable structure that makes it easy to locate a file's location simply based on its name. It also ensures that directories do not grow to a huge scale and become incredibly slow to access.
Just an idea, but it might be worth being more explicit as to what your question is actually about.
I don't think you are going get any concrete answers unless you give more context and describe what the use-case are for the files. Like any other technology decision, the 'best practice' is always going to be a compromise between the different functional and non-functional requirements, and as such the question needs a lot more context to yield answers that you can go and act upon.
Having said that, here are some of the strategies I would consider sound options:
1) Use the conventions dictated by the consumer of the files.
For instance, if the files are going to be used by a CMS/publishing solution, that system probably has some standardized solution for handling files.
2) Use a third party upload solution. There are a bunch of tools that can help guide you to a solution that solves your specific problem. Tools like Transloadit, Zencoder and Encoding all have different options for handling uploads. Having a look at those options should give you and idea of what could be considered "industry standard".
3) Look at proved solutions, and mimic the parts that fit your use-case. There are open-source solutions that handles the sort of things you are describing here. Have a look at the different plugins to for example paperclip, to learn how they organize files, or more importantly, what abstractions do they provide that lets you change your mind when the requirements change.
4) Design your own solution. Do a spike, it's one of the most efficient ways of exposing requirements you haven't thought about. Try integrating one of the tools mentioned above, and see how it goes. Software is soft, so no decision is final. Maybe the best solution is to just try something, and change it when it doesn't fit anymore.
This is probably not the concrete answer you were looking for, but like I mentioned in the beginning, design decisions are always a trade-off, "best-practice" in one context could be the worst solution in another context :)
Best off luck!
From what I understand you want a suggestion on how to store the files. If is that what you want, I would suggest you to have 2 different storage systems for your files.
The first storage would be a place to store the physical file, like a directory on your server (w/o FTP enabled, accessible or not to browsers, ...) or go for Amazon s3 (aws.amazon.com/en/s3/), Rackspace CloudFiles (www.rackspace.com/cloud/cloud_hosting_products/files/) or any other storage solution (you can even choose dropbox, if you want). All of these options offers APIs to save/retrieve the files.
The second storage would be a database, to index and control the files. On the DB, that could be MySQL, MSSQL or a non-relational database, like Amazon DynamoDB or SimpleSQL, you set the link to you file (http link, the path to the file or anything like this).
Also, on the DB you can control and store any metadata of the file you want and choose one or many #ebaxt's solutions to get it. The metadata can be older versions of the file, the words of a text file, the camera-model and geo-location of a picture, etc. Of course it depends on your needs and how it will be really used. You have a very large number of options, but without more info of what you intend to do is hard to suggest you a solution.
On Amazon tutorials area (http://aws.amazon.com/articles/Amazon-S3?browse=1) you can find many papers about it, like Netflix's Transition to High-Availability Storage Systems, Using the Java Persistence API with Amazon SimpleDB and Petboard: An ASP.NET Sample Using Amazon S3 and Amazon SimpleDB
Regards.

How important is a database in managing information?

I have been hired to help write an application that manages certain information for the end user. It is intended to manage a few megabytes of information, but also manage scanned images in full resolution. Should this project use a database, and why or why not?
Any question "Should I use a certain tool?" comes down to asking exactly what you want to do. You should ask yourself - "Do I want to write my own storage for this data?"
Most web based applications are written against a database because most databases support many "free" features - you can have multiple webservers. You can use standard tools to edit, verify and backup your data. You can have a robust storage solution with transactions.
The database won't help you much in dealing with the image data itself, but anything that manages a bunch of images is going to have meta-data about the images that you'll be dealing with. Depending on the meta-data and what you want to do with it, a database can be quite helpful indeed with that.
And just because the database doesn't help you much with the image data, that doesn't mean you can't store the images in the database. You would store them in a BLOB column of a SQL database.
If the amount of data is small, or installed on many client machines, you might not want the overhead of a database.
Is it intended to be installed on many users machines? Adding the overhead of ensuring you can run whatever database engine you choose on a client installed app is not optimal. Since the amount of data is small, I think XML would be adequate here. You could Base64 encode the images and store them as CDATA.
Will the application be run on a server? If you have concurrent users, then databases have concepts for handling these scenarios (transactions), and that can be helpful. And the scanned image data would be appropriate for a BLOB.
You shouldn't store images in the database, as is the general consensus here.
The file system is just much better at storing images than your database is.
You should use a database to store meta information about those images, such as a title, description, etc, and just store a URL or path to the images.
When it comes to storing images in a database I try to avoid it. In your case from what I can gather of your question there is a possibilty for a subsantial number of fairly large images, so I would probably strong oppose it.
If this is a web application I would use a database for quick searching and indexing of images using keywords and other parameters. Then have a column pointing to the location of the image in a filesystem if possible with some kind of folder structure to help further decrease the image load time.
If you need greater security due to the directory being available (network share) and the application is local then you should probably bite the bullet and store the images in the database.
My gut reaction is "why not?" A database is going to provide a framework for storing information, with all of the input/output/optimization functions provided in a documented format. You can go with a server-side solution, or a local database such as SQLite or the local version of SQL Server. Either way you have a robust, documented data management framework.
This post should give you most of the opinions you need about storing images in the database. Do you also mean 'should I use a database for the other information?' or are you just asking about the images?
A database is meant to manage large volumes of data, and are supposed to give you fast access to read and write that data in spite of the size. Put simply, they manage scale for data - scale that you don't want to deal with. If you have only a few users (hundreds?), you could just as easily manage the data on disk (say XML?) and keep the data in memory. The images should clearly not go in to the database so the question is how much data, or for how many users are you maintaining this database instance?
If you want to have a structured way to store and retrieve information, a database is most definitely the way to go. It makes your application flexible and more powerful, and lets you focus on the actual application rather than incidentals like trying to write your own storage system.
For individual applications, SQLite is great. It fits right in an app as a file; no need for a whole DRBMS juggernaut.
There are a lot of factors to this. But, being a database weenie, I would err on the side of having a database. It just makes life easier when things changes. and things will change.
Depending on the images, you might store them on the file system or actually blob them and put them in the database (Not supported in all DBMS's). If the files are very small, then I would blob them. If they are big, then I would keep them on he file system and manage them yourself.
There are so many free or cheap DBMS's out there that there really is no excuse not to use one. I'm a SQL Server guy, but f your application is that simple, then the free version of mysql should do the job. In fact, it has some pretty cool stuff in there.
Our CMS stores all of the check images we process. It uses a database for metadata and lets the file system handle the scanned images.
A simple database like SQLite sounds appropriate - it will let you store file metadata in a consistent, transactional way. Then store the path to each image in the database and let the file system do what it does best - manage files.
SQL Server 2008 has a new data type built for in-database files, but before that BLOB was the way to store files inside the database. On a small scale that would work too.

Resources