Just came across FlockDB graph database. Details at github /flockDB. Twitter claims it uses FlockDB for the following:
Twitter runs FlockDB on a large cluster of machines.
we use it to store social graphs (who
follows whom, who blocks whom) and
secondary indices at twitter.
At first glance, setup and trying it doesn't look straight forward. Have anyone already used it / setup this? If so, please answer the following general queries.
What kind of applications is it
better suited for? (Twitter claims it
is simple and very rough, it remains
to see what it meant though)
How is FlockDB better than other graph db /
noSQL db. Have you setup FlockDB,
used it for a application?
Early advices any?
Note: I am evaluating the FlockDB and other graph databases mainly for learning them. Perhaps, I will build an application for that.
Flockdb is still Yet to be released by Twitter, which means the current version you are seeing won't run properly. Going by the history of commits i guess within a couple of days you can see a stable version which you can build and test.
Compared to something like Neo4J you can say Flockdb is not even a graph database. The toughest part of a graph database is how many levels of depth it can handle. From the little documentation of Flockdb it seems like it can't handle more than 1 level of depth. Where FlockDb wins compared to DBs like Neo4J is it's low latency, high throughput and inherent distributed nature.
Regarding Applications - i guess it will be a great fit whenever you need social networking or twitter like behavior. I don't think many will find such use cases though (who gets 20k friend requests per sec ?).
I Just started looking into Flockdb. Right now i am planning to use it in my forum software. Instead of user1 follows user2 relationship, i am planning to use it for user1 read post1, user1 favorites post1 etc. Being one of the highly active online communities we get a lot of such traffic(read/favorite). Can't think of any other use cases now.
Don't miss OrientDB. It's a document-graph dbms with special operator for traversing relationships: http://code.google.com/p/orient/wiki/GraphDatabase
Related
I'm going to create a fairly large (from my point of view anyway) web project with a friend. We will create a site with roads and other road related info.
Our calculations is that we will have around 100k items in our database. Each item will contain some information like location, name etc. (about 30 thing each). We are counting on having a few hundred thousand unique visitors per month.
The 100k items and their locations (that will be searchable) will be the main part of the page but we will also have some articles, comments, news and later on some more social functions (accounts, forums, picture uploads etc.).
We were going to use Google AppEngine to develop our project since it is really scalable and free (at least for a while). But I'm actually starting to doubt that AppEngine is right for us. It seems to be for webbapps and not sites like ours.
Which system (language/framework etc.) would you guys recommend us to use? It doesn't really mater if we know the language since before (we like learning new stuff) but it would be good if it's something that is future proof.
I think that GAE can do the job. Google claims that Google App Engine is able to handle 5 million visitors for free and you will have to start paying only if you exceed their free quota.
It's also pretty easy to get started. If you don't have experience on administrating websites and choose a regular hosting service, you will have to worry about several things that you don't even imagine now.
My only concern would be with respect of the kind of data and queries you will have to do, since it does not have a relational database. Anyway, there is an open source project for GAE, called GeoModel that gives GAE the ability to do complex geo spacial queries, like proximity fetch. Have a look at their tutorial and the demo app.
About your impression that GAE was intended only for small web apps, there are a couple of CMS that run on it.
Good luck!
If once of your concerns is scalability, and you don't want to depend on expensive or commercial tools, I would recommend that you take a look at this tech stack:
Erlang - A programming language designed for concurrency and distribution.
Nitrogen - An Erlang web framework with a lot of cool stuff, like transparent AJAX.
NoSQL scalable databases, such as CouchDB or Riak - Save the the hassle of SQL code and are more scalable than plain MySQL. Both has direct native Erlang API.
To be honest, I don't know if this tool set is your cup of tea; These are not mainstream solutions. I just suggest these to everyone who ask about size-sensitive web applications.
All serious web frameworks will provide you with what you need. The real issues (for example scalability) might be tackled in a different way depending on what you use, but you wont be limited if you choose a well-known one. The choice of database system might be more important for that (sql vs nosql), even if both of those will do fine too.
It's all about
knowing how to use
enjoying to use
the tool(s) you've chosen.
In either case, name-dropping some suggestions:
Rails (Ruby)
Django (Python)
Nitrogen (Erlang)
ASP.NET MVC (C#)
And please note, if you really want to learn everything from the bottom, you'd be fine with any of these (or one of the other gazillion out there). But if you want to perform your best, choose one that supports a language you know well or uses techniques/tools you have experience of etc. Think twice about how you value this is fun and we learn a lot against we want to be productive and do a really good job.
I am experimenting with App engine. One of my stumbling blocks has been the support for managed relations or lack there off, this is further compounded by the lack of join support.
Without getting into details of the issues I have run into (which I will post under different topic), I would like to ask two things.
1. Has any one of you used managed relations in something substantial. If so if you can share some best practices that will help.
2. Is there any good comprehensive example(s) that you have come across which you can point me at.
Thanks in advance.
I think this answer might disappoint you, but before you develop on the app engine you should read it anyway, and confirm this in the docs.
No. No one on the app engine has used managed relations for anything 'substantial', simply because Bigtable is not built for managed relations. It is a sharded and sorted array, and as such is a very different kind of data structure than what you would normally use.
Now there are attempts to set up managed relationships - the GAE/Java team is pushing JDO features that come close to this, and there's more info on this blog, but this simply isn't the natural state of things on the app engine, and you'll very quickly run into problems if you decide to spend too much time wrapping yourself in a leaky abstraction.
Its a lot easier to actually look at what bigtable really is - there are a ton of videos on the google i/o pages for 2010 and 2009 that do a fantastic job of explaining that, and then figure out ways to map your problem according to the capabilities of the datastore. It may sound unreasonable, but think about it... the GAE is a tool that can do certain things exceedingly well, and if you can figure out your problem in terms of ideas like object stores, sets, merge joins, task queues, pre-computation and caching, then you can use this tool to kick ass.
I have an application that requires analytics for different level of aggregation, and that's the OLAP workload. I want to update my database pretty frequently as well.
e.g., here is what my update looks like (schema looks like: time, dest, source ip, browser -> visits)
(15:00-1-2-2010, www.stackoverflow.com, 128.19.1.1, safari) --> 105
(15:00-1-2-2010, www.stackoverflow.com, 128.19.2.1, firefox) --> 110
...
(15:00-1-5-2010, www.cnn.com, 128.19.5.1, firefox) --> 110
And then I want to ask what is the total visit to www.stackoverflow.com from a firefox browser last month.
I understand Vertica system can do this in a relatively cheap way (performance and scalability wise, but not cost-wise probably). I have two questions here.
1) Is there an open-source product that I can build upon to solve this problem? In particular, how well does a Mondrian system work? (scalability, and performance)
2) Is there an HBase or Hypertable base solution (obviously, a naked HBase/Hypertable can't do this) for this? -- but if there is a project based on HBase/Hypertable, scalability probably won't be an issue IMO)?
Thanks!
You can download a free edition (the single node edition) of the greenplum database. I haven't tried it myself but I think/guess it is a powerful beast. Read here: http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/
Another option is MongoDB, it is fast and free and you can write MapReduce functions with JavaScript to do analytics.
My reputation here is to low to add a hyperlink to mongodb, so you have to google . I can add only one hyper link per post.
The zohmg project aims to solve this problem using Hadoop and HBase.
Facebook also built Hive on-top of Hadoop. Pretty simple to get going - reasonable query API too.
http://mirror.facebook.net/facebook/hive/
Is your data model more complex than that? If it isn't you might be beter of just writing custom code for it. Then you can really tune it to your data. Real products have to offer a lot of flexibility, need a lot of complexiy to achieve that, and suffer in speed as a result.
Your question is not clear in one aspect: when you talk about scalable, what do you mean by that? Are you collecting data from lots of sites but only have a limited amount of query users, or do you also have a lot of users? That situation leads to a significantly different model.
With the rising of non-sql database usage in high traffic website, I'm interested to use it for my project. Now I've heard several names like Voldermort, MongoDB and CouchDB. But which are among these NonSQL database that is production ready? I've seen the download pages and it seems that none of them is production ready because is not version 1.0 yet. Is there any other names other than these 3 that is recommendable to be used in production?
What do you mean by production ready? As far as I know, all of them are being used on live systems.
You should make your choice based on how the features they provide fit your needs.
You can also add Tokyo Cabinet to the list as well as the mnesia database provided by the Erlang VM.
I think you need to start out from your project requirements to see what kind of database you really need. There are many non-relational DBMS:s out there and they differ a lot in what kind of problems they are good at solving. I think the article Should you go Beyond Relational Databases? by Martin Kleppmann is a good starting point for finding out what you need. There's also a lot of stackoverflow threads on similar topics, these are my favorites:
The Next-gen Databases
Non-Relational Database Design
When shouldn’t you use a relational
database?
Good reasons NOT to use a relational
database?
When you have narrowed down what you actually need you can take a deeper look into the alternatives to see which DBMS are production ready for your use case. Production readiness isn't a yes/no thing: people may successfully deploy some solution that for example lacks in tool support - in another project this could be a no-go.
As for version numbers different projects have a different take on this, so you can't just compare the version numbers. I'm involved in the graph database project Neo4j and even if it has been in production use for 5+ years by now we still haven't released a version 1.0 final yet.
I'm tempted to answer "use SIRA_PRISE".
It's definitely non-SQL.
And its current version is 1.2, meaning that someone like you must definitely assume it's "production-ready".
But perhaps I shouldn't be answering at all.
Nice article comparing rdbms with 'next gen' and listing some providers:
Is the Relational Database Doomed?
http://readwrite.com/2009/02/12/is-the-relational-database-doomed
I will suggest you to use Arangodb.
ArangoDB is a multi-model mostly-memory database with a flexible data model for documents and graphs. It is designed as a “general purpose database”, offering all the features you typically need for modern web applications.
ArangoDB is supposed to grow with the application—the project may start as a simple single-server prototype, nothing you couldn’t do with a relational database equally well. After some time, some geo-location features are needed and a shopping cart requires transactions. ArangoDB’s graph data model is useful for the recommendation system. The smartphone app needs a lean API to the back-end—this is where Foxx, ArangoDB’s integrated Javascript application framework, comes into play.
Another unique feature is ArangoDB’s query language AQL — it makes querying powerful and convenient. AQL enables you to describe complex filter conditions and joins in a readable format, much in the same way as SQL.
You can model your data in several ways:
in key/value pairs
as collections of documents
as graphs with nodes, edges, and properties for both
You can access data in ArangoDB:
using the general HTTP REST API via curl/wget, or your browser
via the ArangoDB shell (“arangosh”)
using a programming language specific client library
Server requirements for ArangoDB:
ArangoDB runs on Linux, OS X and Microsoft Windows.
It runs on 32bit and 64bit systems, though using a 32bit system will limit you to using only approximately 2 to 3 GB of data with ArangoDB.
At linkedin, when you visit someones profile you can see how you are connected to them. I believe that linkedin shows upto 3rd level connections if not more, something like
shabda -> Foo user, bar user, baz user -> Joel's connection -> Joel
How can I represent this in the database.
If I model as,
User
Id PK
Name Char
Connection
User1 FK
User2 FK
Then to find the network, three levels deep, I need to get all my connection, their connections, and their connections, and then see if the current user is there. This obviously would be very inefficient with DB of any size, and probably clunky to work with as well.
Since, on linked in I can see this network, on any profile I visit, I don't think this is precalculated either.
The other thing which comes to my mind is probably this is best not stored in a relational DB, but then what would be the best way to store and retrieve it?
My recommendation would be to use a graph database. There seems to be only one implementation currently available, and that's Neo4j. It's written in Java, but has bindings to Ruby and Scala (Python in progress).
If you don't know Java, you probably won't be able to find anything similar on any other platform (yet), unfortunately. However, if you do know Java (or are at least willing to learn), it's way worth it. (Technically you don't even need to learn Java because of the Ruby/Python bindings.) Neo4j was built for exactly what you're trying to do. You'd go through a ton of trouble trying to implement this in a relational database, when you'd be able to do the exact same thing in only a few lines of Java code, and also much more efficiently.
If that's not an option, I'd still recommend looking at other database types such as object databases. Relational databases weren't built for this kind of thing, and you'd go through more pain by trying to do it in an RDBMS than by switching to a different kind of database and learning it.
I don't see why there's anything wrong with using a relational database for this. The tables defined in the question are an excellent start. With proper optimization you'll be able to keep your performance well in hand. I personally think you would need something serious to justify shifting away from such a versatile mainstream product. You'll probably need an RBDMS in the project anyway and there are an unmatchable amount of legitimate choices in many price ranges (even free). You'll get quality documentation, support will be available, and you'll have a large supply of highly trained developers available in the job pool.
Regarding this model of self-relationships (users joined to other users), I recommend looking into recursive queries. That will keep you from performing a cascade of individual queries to find 3 levels of relationships. Consider the following SQL Server method for performing recursive queries with CTEs.
http://msdn.microsoft.com/en-us/library/ms186243.aspx
It allows you to specify how deep you want to go with the MAXRECURSION hint.
Next, you need to start thinking of ways to optimize. That starts with standard best-practices for setting up your tables with proper indexes and maintenance, etc. It inevitably ends with denormalization. That's one of those things you only do once you've already tried everything else, but if you know what you're doing and use good practices then your performance gain will be significant. There are many resources on the internet to help you learn about denormalization, just look it up.