Ideally, where would an application like Facebook store its "Friends" data?
In a database table? in an xml file?
From Facebooks engineering page:
"Already, we are the second most-trafficked PHP site in the world (Yahoo is #1), and one of the largest MySQL installations anywhere, running thousands of databases."
and
"We've built a lightweight but powerful multi-language RPC framework that allows us to seamlessly and easily tie together subsystems written in any language, running on any platform. Facebook is built in PHP, C++, Perl, Python, Erlang, Java, and even a little bit of ML—and it all works together.
* We are the largest user in the world of memcached, an open-source caching system. Originally developed by LiveJournal, we've since made so many scalability improvements and performance upgrades that we will be the primary contributor of features in the next major release.
* We've created a custom-built search engine serving millions of queries a day, completely distributed and entirely in-memory, with real-time updates."
Relational databases?
check out this blog: http://highscalability.com/ many real-world examples of systems architecures to learn from
"Friends" data is well-described in a graph database. Neo4j is an example, though I know it's not the way Facebook stores this information.
Facebook uses a number of database technologies that may be involved:
a patched version of MySQL
Cassandra
Hadoop
... others
Most probably it should contain some other mechanism. As an example a search engine does not keep its index as a database or XML file. To obtain a maximum performance generally they keep some tree (Binary search tree or more complicated one) and store them on disk in performance effective manner. So I guess such mechanism.
Certainly not in a XML file.
Yes, in a database, in one or several tables. And for the precise exemple of facebook, on several server.
Related
I had worked on relational database; but now want to learn about graph database. I came to know that these two are graph database. What is difference between these two databases. What should we prefer among them?
One approach is to simply try to choose one database over the other. For example, you might quickly search around to find that Titan has been forked to JanusGraph where it is more actively maintained. In your research you may find that there are other open source graph databases as well like OrientDb, ChronoGraph, or Sqlg as well as commercial alternatives like Microsoft's CosmosDb, DSE Graph or IBM Graph. How do you decide now?
There is a graph framework that ties together all of these graphs including Neo4j/Titan (and more than those listed here): Apache TinkerPop. TinkerPop provides an abstraction over different graph databases and graph processors allowing the same code to be used with different configurable backends. This pattern is quite similar to the one you find in SQL with JDBC which helps make your code vendor agnostic.
You can try all of the different supported graph databases before you make a choice and you can do this type of prototyping/benchmarking fairly quickly with the Gremlin Console. You will be able to make self-informed choice as to what is the best way to go for your project.
It occurs to me as I come to the end of this post that I haven't directly answered your question. If you are just getting started and are just interested in learning about graph databases, then I likely wouldn't recommend starting with Titan/JanusGraph as it requires a bit of configuration to get started (schemas, backend selection, etc). Start with TinkerGraph or Neo4j using the Gremlin Console to try out some simple graph traversals and go from there.
Titan was originally backed by Aurelius, which was bought by DataStax in 2015. This move was designed to give DataStax a jump-start into the Graph DB world, as they now offer their own "DSE Graph" enterprise product. Titan was since been forked (as previously mentioned) into JanusGraph.
The nice thing about Titan/Janus (IMO) is that it is "pluggable" with other existing back-end and search technologies. So it will "play nice" with things like Cassandra, HBase, Hadoop, Solr, and ElasticSearch.
The drawback is that the community support is tough. The Titan project has been effectively killed, and Janus scores a whopping 0.23 on DBEngines. That makes it the 16th most-popular Graph DB (231st overall), which is pretty low.
Neo4j is backed by Neo Technology, and is regarded as the front-runner in the Graph DB community (score of 38.52 right now, 1st graph DB and 21st overall). It is open source, but controlled by Neo Technologies so they can dictate a difference in feature set between open source and enterprise.
The nice thing about Neo4j is that they have a lot of tutorials and learning aids built right-in to the Neo4j Browser, which is a nice, user-friendly web interface. Their documentation is top-notch, easy to read and search through, and they have a pretty good following here on Stack Overflow.
Neo4j Browser screenshot:
The drawback of Neo4j, is that some features (like clustering) are only available in the enterprise version. But if you work for a big company who doesn't mind shelling-out $ for an enterprise license, that may not be a big deal.
Consistency: Titan/Janus is a part of the "eventual consistency" crowd, while Neo4j aims to be strong-consistent (especially in a causal clustering scenario). Although consistency can be tuned with configuration in both, with Titan/Janus that can be dependent on your choice of pluggable backend (ex: typically strong-consistent with HBase, while eventually consistent with Cassandra).
Recommendation:
If you're just starting to learn graph databases and modeling, you can't go wrong with Neo4j. Simply download/install the community edition, run it, and execute :play movies as your first command (tutorial that walks you through loading, modeling, and querying movie relationships).
If you have some experience with graph, and you don't mind troubleshooting/googling to figure out things (like how to set the max frame size for Thrift), then you could probably do some really cool things with Titan.
Try each out, and see which one works for you.
There are far more than two graph databases - there are dozens. That being said, there are two with real market share: Neo4j and Titan/JanusGraph. But there are dozens of other graph datases, each with interesting strengths for different specific application spaces. That being said, I wouldn't dig into all of the niche players to start with - learning the basic idea of graph databases can be done with one of the two lead players.
Neo4j is the most mature, with the most nicely packaged install and documentation, tons of reference code, and support from a wide range of partners.
Titan/JanusGraph is the next most popular, as it's free/open source and has very strong support (e.g. IBM, Google, Hortonworks, AWS, ...). There's a recent complexity in that the leaders of the Titan project were acquired, freezing the Titan project. But the community forked the project into JanusGraph. So while JanusGraph is a new project, it's literally the same Titan code, with even broader industry support than Titan had.
Related to the two is the language used to work with the graphs. Neo4j uses its proprietary language, Cypher, while nearly everyone else uses Gremlin, and the TinkerPop open source tool set (which is a part of the Apache set of open source projects). Nearly all graph databases, including Neo4j, support Gremlin and TinkerPop. So, for example, you can use either Cypher or Gremlin to query Neo4j, though Neo (and some other proprietary graph database vendors) support Gremlin as a second-class citizen, so to speak. For example, you can connect to Neo using Gremlin from the (external) Gremlin console, but you can't use Gremlin in the (very nice) Neo4j console.
Note that there are many graph databases that support Gremlin other than Titan/JanusGraph. One new entrant that's very interesting is Microsoft's Azure Cosmos DB, which is a managed graph database that's "cheap and easy" if you use Azure already. And there are several vendors that provide managed JanusGraph.
For personal learningk I'd say that Neo4j is the easiest to set up and learn - you download and run it, and open a web browser onto their web-based console, which only takes a few minutes. That being said, if you're comfortable on a command line JanusGraph only took a half hour to install and get running for me, so it's not too hard.
For learning the concepts Neo4j is great. Neo4j's query language, Cypher, and JanusGraph's query language, Gremlin, are semantically identical, just spelled differently, so you'll learn the concepts either way.
For building a real system, either could work (and there are many successful following both approaches).
For which you choose, you'll want to think about whether you want to be strategically tied to a single vendor (Neo4j) or in a broader standards-based community. There's comfort level in picking the market leader with the most mature product - Neo4j. And there's a comfort level in picking open standards with strong industry support - JanusGraph. So IMO there's no "wrong" answer - people using either one are happy and successful. But since you have to pick, you'll need to think about which you're more comfortable with long-term.
Neo4j uses native graph technology.
Native graph technology ensures that data is stored efficiently by writing nodes and relationships close to each other.
It optimizes the graph DB.
With native graph technology, processing becomes faster because it uses index-free
adjancey. That means each node directly references its adjacent nodes.
Titan (Now JanusGraph) uses non-native graph technology.
In non-native we use different storage backends like Cassandra, HBase
With non-native processing becomes slowers compared to native because database uses
many types of indexs to link nodes together.
I was just wondering how many db queries might facebook be issuing to render a user's home page. Does anybody have some idea on how the facebook DB is designed. I've heard it runs MySql and there are thousands of replica plus more memcache server than DB Servers.
Is the facebook data shard-ed?
If it is does it go to every shard and search for the latest update of my friend. In worst case if I've 100 friends and suppose facebook has 101 shards, there is a possibility that each of my friend is in a different shard. How might facebook be handling this?
I'll be highly grateful if somebody can provide me seom hints or pointers towards something like "How to Design a DB for Social Networking Website". I'm just curious!
Facebook is using LAMP structure. Facebook’s backend services are written in a variety of different programming languages including C++, Java, Python, and Erlang and they are used according to requirement. With LAMP Facebook uses some technologies ,to support large number of requests, like
Memcache - It is a memory caching system that is used to speed up dynamic database-driven websites (like Facebook) by caching data and objects in RAM to reduce reading time. Memcache is Facebook’s primary form of caching and helps alleviate the database load. Having a caching system allows Facebook to be as fast as it is at recalling your data.
Thrift (protocol) - It is a lightweight remote procedure call framework for scalable cross-language services development. Thrift supports C++, PHP, Python, Perl, Java, Ruby, Erlang, and others.
Cassandra (database) - It is a database management system designed to handle large amounts of data spread out across many servers.
HipHop for PHP - It is a source code transformer for PHP script code and was created to save server resources. HipHop transforms PHP source code into optimized C++. After doing this, it uses g++ to compile it to machine code.
If we go into more detail, then answer to this question go longer. We can understand more from following posts:
How Does Facebook Work?
Data Management, Facebook-style
Facebook database design?
Facebook wall's database structure
Facebook "like" data structure
At this website you find lots of details about those big internet companies and their technical structures:
http://highscalability.com/
Adding to #Somnath Muluk's answer - Facebook uses few other technologies like Hadoop etc.
Refer to the following links for more details:
http://www.quora.com/Facebook-Engineering/What-is-Facebooks-architecture
Facebook Architecture
Hope it helps.
Zero. On average, that is. A highly interconnected network with a large number of users such as Facebook can only run effectively if it runs fully out of ram for the pages that are shown often. Nearly all data should already be in the memcache.
I have received a message about CUBRID database they said that it's better than MySQL in performance, so any one heard about it.
Is that correct
Regards
I use CUBRID in most of my projects. The idea of being "better than MySQL", I think, depends on the situation, on the needs of your application. For some CUBRID is really nice, for some MySQL, or some other one. For example, CUBRID has very nice features optimized for Social Networking Services where you have heavy traffic often on one page, use lots of indexes, and take advantage of covering index. They provide some nice examples how to design your database schema and how to tune queries to obtain the best performance (link).
What's your case? If you expect simultaneously several hundred users who generate some thousands of new records every day, CUBRID can easily handle all these. This is what database systems are created for.
You should also consider the environment you are developing in. Is your app developed on PHP, Python, or what? We use PHP and Java on our sites. CUBRID has many Drivers. I believe you can find the necessary driver on their site.
You should also look at the community support. If you have some questions or issues with their database, it's often faster to directly write on their Q&A site or forum.
For a new application based on Erlang, Python, we are thinking of trying out a non-RDBMS database(just for the sake of it). Some of the databases I've researched are Mongodb, CouchDB, Cassandra, Redis, Riak, Scalaris). Here is a list of simple requirements.
Ease of development - I need to make a quick proof-of-concept demo. So the database needs to have good adapters for Eralang and Python.
I'm working on a new application where we have lots of "connected" data. Somebody recommended Neo4j for graph-like data. Any ideas on that?
Scalable - We are looking at a distributed architecture, hence scalability is important.
For the moment performance(in any form) isn't exactly on top of my list, and I don't think we'll be hitting the limitations of any of the above mentioned databases anytime soon.
I'm just looking for a starting point for non-RDBMS database. Any recommendations?
We have used Mnesia in building an Enterprise Application. Mnesia when in a mode where the tables are Fragmented performs at its best because it would not have table size limits. Mnesia has performed well for the last 1 year and is still on. We have around 15 million records per table on the average and around 24 tables in a given database Schema.
I recommend mnesia Database especially the one that comes shipped within Erlang 14B03 at the Erlang.org website. We have used CouchDB and Membase Server (http://www.couchbase.com)for some parts of the system but mnesia is the main data storage (primary storage). Backups have been automated very well and the system scales well against increasing size of data yet tables running under many checkpoints. Its distribution, auto-replication and Complex Data Model enabled us to build the application very quickly without worrying about replication, scalability and fail-over / take-over of systems.
Mnesia Scales well and it's schema can be configured and changes while the database is running. Tables can be moved, copied, altered e.t.c while the system is live. Generally, it has all features of powerful systems built on top of Erlang/OTP. When you google mnesia DBMS, you will get a number of books and papers that will tell you more.
Most importantly, our application is Web based, powered by Yaws web server (yaws.hyber.org) and we are impressed with Mnesia's performance. Its record look up speeds are very good and the system feels so light yet renders alot of data. Do give mnesia a try and you will not regret it.
EDIT: To quickly use it in your application, look at the answer given here
Ease of development - I need to make a quick proof-of-concept demo. So the database needs to have good adapters for Eralang and Python.
Riak is written in Erlang => speaks Erlang natively
I'm working on a new application where we have lots of "connected" data. Somebody recommended Neo4j for graph-like data. Any ideas on that?
Neo4j is great for "connected" data. It has Python bindings, and some Erlang adapters How to Use Neo4j From Erlang. Thing to note, Neo4j is not as easy to Scale Out, at least for free. But.. it is fully transactional ( even JTA ), it persists things to disk, it is baked into Spring Data.
Scalable - We are looking at a distributed architecture, hence scalability is important.
For the moment performance(in any form) isn't exactly on top of my list, and I don't think we'll be hitting the limitations of any of the above mentioned databases anytime soon.
I believe given your input, Riak would be the best choice for you:
Written in Erlang
Naturally Distributed
Very easy to develop for/with
Lots of features ( secondary indicies, virtual nodes, fully modular, pluggable persistence [LevelDB, Bitcask, InnoDB, flat file, etc.. ], extremely reliable, built in full text search, etc.. )
Has an extremely passionate and helpful community with Basho backing it up
With the rising of non-sql database usage in high traffic website, I'm interested to use it for my project. Now I've heard several names like Voldermort, MongoDB and CouchDB. But which are among these NonSQL database that is production ready? I've seen the download pages and it seems that none of them is production ready because is not version 1.0 yet. Is there any other names other than these 3 that is recommendable to be used in production?
What do you mean by production ready? As far as I know, all of them are being used on live systems.
You should make your choice based on how the features they provide fit your needs.
You can also add Tokyo Cabinet to the list as well as the mnesia database provided by the Erlang VM.
I think you need to start out from your project requirements to see what kind of database you really need. There are many non-relational DBMS:s out there and they differ a lot in what kind of problems they are good at solving. I think the article Should you go Beyond Relational Databases? by Martin Kleppmann is a good starting point for finding out what you need. There's also a lot of stackoverflow threads on similar topics, these are my favorites:
The Next-gen Databases
Non-Relational Database Design
When shouldn’t you use a relational
database?
Good reasons NOT to use a relational
database?
When you have narrowed down what you actually need you can take a deeper look into the alternatives to see which DBMS are production ready for your use case. Production readiness isn't a yes/no thing: people may successfully deploy some solution that for example lacks in tool support - in another project this could be a no-go.
As for version numbers different projects have a different take on this, so you can't just compare the version numbers. I'm involved in the graph database project Neo4j and even if it has been in production use for 5+ years by now we still haven't released a version 1.0 final yet.
I'm tempted to answer "use SIRA_PRISE".
It's definitely non-SQL.
And its current version is 1.2, meaning that someone like you must definitely assume it's "production-ready".
But perhaps I shouldn't be answering at all.
Nice article comparing rdbms with 'next gen' and listing some providers:
Is the Relational Database Doomed?
http://readwrite.com/2009/02/12/is-the-relational-database-doomed
I will suggest you to use Arangodb.
ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.
ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
Another unique feature is ArangoDB’s query language AQL — it makes querying powerful and convenient. AQL enables you to describe complex filter conditions and joins in a readable format, much in the same way as SQL.
You can model your data in several ways:
in key/value pairs
as collections of documents
as graphs with nodes, edges, and properties for both
You can access data in ArangoDB:
using the general HTTP REST API via curl/wget, or your browser
via the ArangoDB shell (“arangosh”)
using a programming language specific client library
Server requirements for ArangoDB:
ArangoDB runs on Linux, OS X and Microsoft Windows.
It runs on 32bit and 64bit systems, though using a 32bit system will limit you to using only approximately 2 to 3 GB of data with ArangoDB.