How does Google file document objects?

How does Google file document objects? - filesystems

How does Google store and organize documents, such as documents in Google Docs?
I'd like to ask which file system Google uses, but I know it uses GFS, a distributed file system for storing huge files, basically the huge databases containing, among other things, also the Documents I am interested in.
My question is: Is each document a record in a DB? And how does it identify documents in a hierarchical system, such as web pages? How does it relate them, or represent the hierarchical structure, if needed?
It looks like Google created his own "file system" in a DB (besides the underlying GFS).
Does anyone know any specification or its working?

For Google Drive, then yes, databases are involved, but unfortunately I can't really share any of the confidential details, sorry.
p.s. check out the GFS paper: http://research.google.com/archive/gfs.html

Related

Neo4J: Binary File storage and Text Search "stack"

I have a project I would like to work on which I feel is a beautiful case for Neo4j. But there are aspects about implementing this that I do not understand enough to succinctly list my questions. So instead, I'll let the scenario speak for itself:
Scenario: In simplicity, I want to build an application that will allow Users who will receive files of various types such as docs, excel, word, images, audio clips and even videos - although not so much videos, and allow them to upload and categorize these.
With each file they will enter in any and all associations. Examples:
If Joe authors a PDF, Joe is associated with the PDF.
If a DOC says that Sally is Mary's mother, Sally is associated with Mary.
If Bill sent an email to Jane, Bill is associated with Jane (and the email).
If company X sends an invoice (Excel grid) to company Y, X is associated with Y.
and so on...
So the basic goal at this point would be to:
Have users load in files as they receive them.
Enter the associations that each file contains.
Review associations holistically, in order to predict or take some action.
Generate a report of the interested associations including the files that the associations are based on.
The value for this project is in the associations, which in reality would grow much more complex then the above examples and should produce interesting conclusions. However. if the User is asked "How did you come to that conclusion", they need to be able to produce a summary of the associations as well as any files that these associations are based on - ie the PDF or EXCEL or whatever.
Initial thoughts...
I also should also add that this applicatoin would be hosted internally, and probably used by approx 50 Users so I probably don't need super-duper, fastest, scalable, high availability possible solution. The data being loaded could get rather large though, maybe up to a terabyte in a year? (Not the associations but the actual files)
Wouldn't it be great if Neo4J just did all of this! Obviously it should handle the graph aspects of this very nicely, but I figure that the file storage and text search is going to need another player added to the mix.
Some combinations of solutions I know of would be:
Store EVERYTHING including files as binary in Neo4J.
Would be wrestling Neo4J for something its not built for.
How would I search text?
Store only associations and meta data in Neo4J and uploaded file on File system.
How would I do text searches on files that are stored on file server?
Store only associations and meta data in Neo4J and uploaded file in Postgres.
Not so confident of having all my files inside DB. Feel more comfortable having all my files accessible in folders.
Everyone says its great to put your files in DB. Everyone says its not great to put your files in DB.
Get to the bloody questions..
Can anyone suggest a good "stack" that would suit the above?
Please give a basic outline on how you would implement your suggestion, ie:
Have the application store the data into Neo4J, then use triggers to update Postgres.
Or have the files loaded into Postgres and triggers update Neo4J.
Or Have the application load data to Nea4J and then application loads data into Postgres.
etc
How you would tie these together is probably what I am really trying to grasp.
Thank you very much for any input on this.
Cheers.
p.s. What a ramble! If you feel the need to edit my question or title to simplify, go for it! :)

Here's my recommendations:
Never store binary files in the database. Store in filesystem or a service like AWS S3 instead and reference the file in your data model.
I would store the file first in S3 and a reference to it in your primary database (Neo4j?)
If you want to be able to search for any word in a document I would recommend using a full text search engine like Elastic Search. Elastic Search can scan multiple document formats like PDF using Tika.
You can probably also use Elastic/Tika to search for relationships in the document and surface them in order to update your graph.
Suggested Stack:
Neo4j
ElasticSearch
AWS S3 or some other redundant filesystem to avoid data loss
Bonus: See this SO question/answer for best practices on indexing files in multiple formats using ES.

Differences between CMS and RDBMS

I was actually reading up on CMS systems like Drupal, and noticed that they're similar to RDBMSes. I was wondering what the differences might be between them. When would we use an RDBMS and when would we use a CMS? Kinda confused and appreciate any input on this. Thanks!

You are asking what is the difference between a car and a Diesel engine.
CMS (Content management system) is a full-blown system allowing end-users to view and modify content (articles, media, etc.) in an easy way. See: List of content management systems.
RDBMS (Relational database management system) is a back-end software used to store low-level data. Typically CMSes use RDBMSes (like MySQL), but this is not a requirement. Nowadays end-users seldom use RDBMS directly.

After researching this topic (and being a Senior Oracle DBA and Java Developer) and reaching a good understanding of both, I can answer this.
Think of a CMS like Amazon, they host digital content such as books, movies and a myriad of other merchandise. Amazon has a fairly open ended query capability that helps you find products quickly and accurately (most times anyway!).
Now imagine this, Amazon goes in and organizes content a little different for logic/efficiency sakes, so what happens to the applications that use it...typically nothing!
You have to create a schema in an RDBMS and write code to interact with that schema. A CMS has facilities to store, organize and retrieve content (although some software would have to be developed that uses/interfaces with the "system"). Arguably, one could create database tables that have blob/clob types and make something CMS like.
One fine difference is that a CMS can support MIME types such as PDF and .DOCX files and understand how to search those for content as well (unlike a BLOB, it's just a BLOB right?).
I wouldn't limit a CMS as saying it is a "document repository" for web-based applications, because it is much more than that; a CMS can store structured, semi-structured and completely unstructured data (executables/.bin files/imagery...whatever!)

They're not similar at all. A CMS is a Content management System - it's used for maintaining content in a website. In a CMS you generally think of content as "pages" or "documents"
An RDBMS is a Relational Database Management System. An RDBMS manages data - in the form of text, numbers, etc, in a highly relational format. In an RDBMS, you think of data as "numeric values", strings of "text", "datetimes", or other primitive data formats.
Usually a CMS uses an RDBMS under the hood to store the data, which the CMS displays as page content.

They are really not the same:
RDBMS : Relational Database Management System - for handling data (SQL, MYSQL)
http://en.wikipedia.org/wiki/Relational_database_management_system
CMS: Content Management System - For working with web site content
http://en.wikipedia.org/wiki/Content_management_system

I think you are missing the point here...As far as i know a CMS uses an RDBMS to provided its data. As far as i know a CMS is just an abstraction layer built upon a database. It's much easier for a user to edit content by using the cms, then by directly editing data in a database table.

A CMS, strictly speaking, is a Content Management System, typically used for serving up web pages (SharePoint and other document repositories not withstanding). An RDBMS is a Relational Database Management System. CMSs usually reside within an RDBMS. RDBMSs are also typically used for much more intensive purposes than a CMS, for example, storing customer information or product data.

So one can integrate code e.g. java ... in a CMS to interact with a RDBMS e.g. MySql to endup with a CRUD (Create , Read, Update, Delete) web type application ?
The idea behind this, is to allow any user of the web site to update (specifically allowed) data stored in the RDBMS.
We seem to be getting close to an eCommerce website ...
CMS+RDBMS+code = eCommerce website
Not sure if that exists though ?

Beginners advice on server side file storage

I'm new to stack overflow so hi. I'm also pretty new to the programming scene so I'll try be as technically adept as possible.
Basically I'm designing and application that stores files and schema's on the server side.
The simple architecture is that there will be 2 databases. One for file storage and the other to hold XML files that point to the files.
I'm very new to this and after looking around at some different options (fedora commons, php nuke, remository) I'm feeling a bit lost and just looking for some advice or good paths on the topic.
For general information. The files being stored in the repository/ database will be quite small but there will be a lot of them. The XML schema will point towards the location of these files.
EG files location is /images/stackOverflow/mine.jpg
And the XML would look like
<images>
<location>/images/stackOverflow/mine.jpg</location>
</images>
The architecture is pretty flexible and theres no content management system in place so any advice is appreciated.
I done a search on stack but most questions were on client side storage instead of server side.

For storing documents you could look into a Document-Oriented Database or Document Management System.
Document-Oriented Database
You can also look at other NoSQL databases that may make more sense than a typical RDBMS that are XML based.
XML Database
Yet another option is utilizing an RDBMS for document storage which allows you to organize documents by record and perform industry standard SQL queries to search for and retreive documents.

NoSQL for filesystem storage organization and replication?

We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?

Have you had a look at MongoDB's GridFS.
http://www.mongodb.org/display/DOCS/GridFS+Specification
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.

Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.

For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?

Common Interface for CouchDB and Amazon S3

I just read tons of material on Amazon's S3 and CouchDB. Maybe not enough yet though, so here is my question:
Both systems sound very appealing to me. CouchDB is distributed using the Apache License V2 and with Amazon's S3, you pay per stored megabyte and the traffic you generate. So there is a bit of a difference monetarily.
But from a technical point of view, from what I understood, both systems help you at storing unstructured data of arbitrary sizes (depending on the underlying OS as I understand from CouchDB).
I don't know how easy it would be to come up with a unified interface for both of them, so that you could just change your "datastore provider" as the need arises? Not having to change any of your code.
I also don't know if this is technically easily feasible, haven't looked at their protocols yet in great detail. But it would be great to postpone the provider decision to as late as possible.
Also this could be interesting for integration testing purposes: You could for example test against a local CouchDB instance and run your code against S3 for production use.
To formulate my question from a different angle: Is Amazon's S3 and CouchDB essentially solving the exact same thing or is this insane and I missed the whole point?
Updated Question
After Jim's brilliant answer, let me then rephrase the question to:
"Common Interface for CouchDB and Amazon SimpleDB"
And following the same lines of thinking, do you see a problem with a common interface between CouchDB and SimpleDB then?

You're missing the point, just slightly. CouchDB is a database. S3 is a filesystem. They're both relatively unstructured, but with S3 you're storing files under keys while with CouchDB you're storing (arbitrarily-structured) data under keys.
The Amazon Web Services analogue to something like CouchDB would be Amazon SimpleDB.
Something like what you're looking for already exists for Ruby, and it's called Moneta. It even can store stuff on S3, which may be exactly what you want.

You are wrong Jim. S3 is not a filesystem. It is a webservice for a key-value store.
Amazon provides you with a key. Yes, the value of that key can be data that represents a file. But, how that gets managed in the Amazon system is something entirely different. It can be stored in one node, multiple nodes, geographically strategic nodes with cloudfront, and so on. There is nothing in that key in and of itself that indicates how the system will manage the file. The value of the key is never a file directly. It is data that represents the file. How that value gets eventually resolved into a file that the client receives is entirely separate.
The value of that key can actually be data that does not represent a file. It can be a JSON dictionary. In that sense, S3 could be used in the same way as CouchDB.
So I don't think the question is missing the point. In fact, it is a perfectly legitimate question as data in CouchDB is not distributed amongst nodes. And that could hamper performance.
Let's not even talk about Amazon SimpleDB. That is something separate. Please don't mix terms and then make claims based on that.
If you are not convinced by this claim, and if people request it, I am happy to provide a code bit that illustrates a JSON dictionary in S3.
I respect your answers to other questions Jim. But, here, you are clearly wrong and cannot see how those points are justified.

Technically a common layer is possible. However I question that this would make sense.
Couchdb has integrated map/reduce functions for your documents which are exposed as "views". I don't think SimpleDB has anything like that. On the other hand SimpleDB has query expressions which Couchdb has not. Of coure you can model those expressions as a view in Couchdb if you know your query at development time.
Beside that the common function is not more than create/update/delete a key-document pair.

This is an old question, but it comes up first in my searches for this.
PouchDB is a fully compliant CouchDB interface, and uses LevelDown for its backend. It is capable of using any LevelDown compliant interface.
S3 LevelDown is a Level Down compliant interface that uses S3 as its backend store.
https://github.com/loune/s3leveldown
In theory, you could put them together to create a full CouchDB style implementation, backed by S3.
https://github.com/loune/s3leveldown/tree/master/examples/pouchdb

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight