Graph Database Design Methodologies - database

I want to use a graph database for a web application (involving a web of Users, Posts, Comments, Votes, Answers, Documents and Document-Merges and some other transitive relationships on Users and Documents). So I start asking myself if there is something like a design methodology for Graph Databases, i.e. a kind of analogon to the design principles recommended for Relational Databases (like those normal forms)?
Example questions (of many questions arising):
Is it a good idea, to create a Top-Node Users, having relationships ("exist") on any User-Node in the Database?
Is it a good idea to build in version management (i.e. create relationships (something like "follows")) pointing to updated versions of a Document / Post in a way that going back this relationship means watching the changes the document went through.
etc...
So, do we need a Graph Database Design Cookbook?

The Gremlin User Group (http://tinkerpop.com/) and Neo4j User Group (https://groups.google.com/forum/?fromgroups#!forum/neo4j) are good places to discuss graph-database modeling.
You can create supernodes such as "Users," but it may be better and more performant to use indexes and create an index entry for each user with a key=element_type, value="user", id=user_node_id.
A "follows" relation is often used for people/friends like on Facebook and Twitter so I wouldn't use that for versioning. You can build a versioning system into to Neo4j that timestamps each entry and use a last-write wins algorithm, and there are other database systems like Datomic that have this built in.
See Lightbulb's model (https://github.com/espeed/lightbulb/blob/master/lightbulb/model.py) for an example blog model in Bulbs/Python (http://bulbflow.com).

Related

What database to choose for the hierarchical content with relations?

I want to have a reviews-like website, but not only with reviews, other types of content as well. The design of the website combines both hierarchical structure (each content object/record/entity has a parent - kind of container), and relations - each content object/record/entity has a number of related other objects:
an author of the content (i.e. user)
related comments (with their own relations, particularly authors)
item being reviewed as a separate record in DB
images from the gallery
One of the most important things is performance. Relations used to be inefficient in the NoSQL, as I've read on the net and already tried out with other projects. On the other hand, the general design, apart from the relations mentioned, has an obvious content repository like structure, which is the exact reflection of hierarchical arrangement of objects (documents, articles, reviews) websites are designed. Also, I really like the loose structure of the records in NoSQL. Yet, I don't care about (nor use) things like versioning and other things related to NoSQL.
So I want to combine both wordls: hierarchical and relational within one project, or actually, its model. Apart from it, I want the project to be restful, so that a mobile apps could use the same content available through the API. Another requirement is that the content should be searchable.
What type of storage would you choose for a project like this?
I decided to go with the Graph DBs. Here's why I rejected the other ones:
I don't want to use NoSQL (Documents), since relations are hard to maintain and often require extra code infrastructure (often custom) to handle them, see e.g. Diaspora NoSQL problems
I don't want to use RDBMS, since the structure based DBs impose well known limitations and doesn't reflect the domain
I rejected the key-value and big table DBs as they have very specific use cases
Graph Databases have been used in number of content-oriented projects, and appeared to be doing the job surprisingly well.
You can easily model a hierarchical data structure in SQL with the following (using PostgreSQL):
CREATE TABLE comments (
id INTEGER,
parent INTEGER,
content VARCHAR(1024)
)
Where parent refers to the id of the parent comment.
If you are after a NoSQL database that exposes a RESTful interface, you could consider CouchDB.
You can then replicate CouchDB to Elasticsearch for more robust searching.
But if your data is relational then I would very much recommend you consider a SQL database like PostgreSQL first.

How does Facebook (or how would you) partition an incredibly large database of friends?

How would you store 10 billion friends where a friend can be anyone's friend? The simplest solution is to create a database, a table called Person and create a many-to-many association between Person and Person.
But this would not scale properly. This data would need to be partitioned across many databases around the world so the load can be properly distributed.
As a software developer who's kind new to database development, I'm curious how the SO community would solve this problem.
It's likely they don't use a relational database to store this information. It's more likely some flavor of NoSQL, possibly a Graph Database.
It's possible there are blogs (from facebook or third parties) that discuss how it is actually done--the above is just my own speculation based on a basic understanding of these kinds of data stores.

Persisting User Preferences - Relational or Document-Oriented Database

I am looking into persisting user preferences past session expiration for an application and was curious if based on people's previous experiences a Relational Database (i.e. Oracle, MySql) or Document-Oriented Database (i.e. MongoDB, Redis) is better suited for this task. To help clarify the meaning of user preferences, my web-application would be storing pretty detailed information on a per-user basis including but not limited to: window size and position, grid column width and order, various widget states (collapsed/un-collapsed panels). All persistence in my application is currently handled by a Relational Database, but I have a feeling that something like user preferences may lend itself better to a Document-Oriented Database because it may be hard to represent this data in a strictly-structured way and a semi-structured approach may be better.
If you are already using a relational database for your application, it makes little sense to separate out just user privileges to a document-oriented db - it would just increase complexity. Starting a new app, it's worth considering.
For existing application you may consider using semi-structured data stores, like Postgresql's hstore.
The question being asked is Suitability not Practicality of installing new DB.
What is DB better suited for non-relational data like user preference ?
Clearly the answer should be non-relational DB. Document oriented NoSQL databases are suitable to storing these.
The OP mentioned Widgets etc preferences which are most likely JSON a document/objects. This is another reason mongoDB or JSON document oriented DB is more suitable.
There is also a fear of "installing new database" which is coming from the experience/pains of older relational databases which none of these NoSQL will have. But all this is besides the "suitability" question. Many factors will go into the "practicality" decision besides just the dependency.

Relational data model to Google datastore mapping

First off, I come from a RDBMS/SQL/C++/Java/Python background and I'm a newbie
to Gaelyk, the Google API and the Google datastore.
I like to model (using flowcharts for code and DB modeling tools for the database)
before I code.
I've used Erwin heavily in the past to do DB modeling.
In Erwin, I've designed a logical / physical data model of a database I'd like to
implement using the Google datastore and Gaelyk with the Google AppEngine SDK.
I wanted to design the data layout before coding anything.
My design tool of choice has been Erwin Data Modeler.
When I looked at the Google datastore, I saw that there
are no relational constraints, and joins are done via
WHERE clause :bind variables.
How can I map my existing model (with PKs/FKs, dependent entities, heavy relational links) to the Google datastore?
Is there a modeling tool that will allow me to design for the Google datastore?
Is the DB design supposed to flow from the Gaelyk MVC pattern and direct coding?
I'm not used to this as I come from an RDBMS background where you model heavily
and all good things come from good relational design.
Also, before coding a database client app in an imperative language (C++, C, Java, Python),
I like to write pseudocode, BUT first and foremost comes the DB design (if the app
has a DB back-end)
Am I doing this all wrong? It looks like there's a set of tools available to me
to start coding, but the design tool set is not there.
Addendum:
Here is the logical model I'm trying to map
How would I map a circular relationship
account --(1:m)-- following --(m:1)-- following_account_id --(1:1)-- account_id?
In general, the guiding principle of the App Engine datastore - and all nonrelational databases - is "optimize for reads". In short that means denormalize, denormalize, denormalize. In some cases, that will make updates harder - for example, if you make your username the primary key of your accounts table, and a user wants to change usernames - and in some cases that will require duplicating data, such as storing persistent counts. All of this is worthwhile, though, since it gives much better read performance and scalability, and in a typical webapp, reads outnumber writes by factors of hundreds to one.
Looking at your model in particular, it's very normalized - more so than most RDBMS models I've seen, even. Some suggestions:
Roll up things like 'user_name_id' into your main accounts table.
For things like 'following', use a list property if the number of people someone follows is typically small (<1000), or the fan-out pattern otherwise.
Pick a reasonable primary key for each table where practical, such as username or email, and use that as a key name. This allows looking up records with get operations instead of queries, which are substantially faster.
When a lookup table such as 'account type' is necessary, make sure the foreign key is sufficiently descriptive you only have to look up the corresponding record for administrative actions. Better, store small, infrequently changing details like this outside the datastore, so they can be accessed instantly.
For things like tags, use list properties to reduce the number of times you have to lookup related entities, and to make indexing easier.
This only scratches the surface, of course, and there's a lot of collected wisdom here on SO, in the groups, and on blogs like mine. Feel free to come back and ask specific questions about data modelling!
To answer your other questions, no, there are no GAE-specific data modelling tools I'm aware of, but you can use a standard diagramming tool as you already are. Models are indeed defined in code, since the datastore is schemaless, but that doesn't have to be a barrier to the order in which you implement things.

Is there a nosql store that also allows for relationships between stored entities?

I am looking for nosql key value stores that also provide for storing/maintaining relationships between stored entities. I know Google App Engine's datastore allows for owned and unowned relationships between entities. Does any of the popular nosql store's provide something similar?
Even though most of them are schema less, are there methods to appropriate relationships onto a key value store?
It belongs to the core features of graph databases to provide support for relationships between entities. Typically, you model your entities as nodes and the relationships as relationships/edges in the graph. Unlike RDBMS you don't have to define relationships in advance -- just add them to the graph as needed (schema-free). I created a domain modeling gallery giving a few examples of how this can look in practice. The examples use the Neo4j graphdb, a project I'm involved in. The mailing list of this project use to prove very helpful for graph modeling questions.
The document-oriented database Riak has support for links between documents.
You can add support for relationships on top of any database engine (like key/value), but it doesn't come whithout work. It all comes down to your use case. If you provide more details it's easier to come up with a useful answer.
Oops, now I saw that the title says "nosql store" and then your actual question narrows this down to "nosql key value store". As key/value stores have no semantics for defining relationships between entities I'll still post my answer.
MongoDB is a document database, not a key/value store. It does provide, however, a simple form of inter-document references. These work more-or-less like SQL foreign keys that are automatically nulled when the referenced object is deleted.
This is adequate for the same sorts of things for which you'd use foreign keys, but it isn't optimized for serious graph traversal.
The relationships in the Google App Engine are only keys to entities that are automatically de-referenced when accessed in code. And are only values when used to filter against. Its a function of the DB Api rather than anything explicit, so the access to the ReferenceProperty will simply perform a query against the referenced model to get access to the object.
If you look at something like MongoDB, the relationships are stored in-object (from what I remeber), but they can also be stored however you want in the sense that you would create an API that would search the joined table for your item in the relationship in a similar manner to who the App Engine works.
Paul.

Resources