What are the use cases of Graph-based Databases (http://neo4j.org/)? [closed] - database

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have used Relational DB's a lot and decided to venture out on other types available.
This particular product looks good and promising: http://neo4j.org/
Has anyone used graph-based databases? What are the pros and cons from a usability prespective?
Have you used these in a production environment? What was the requirement that prompted you to use them?

I used a graph database in a previous job. We weren't using neo4j, it was an in-house thing built on top of Berkeley DB, but it was similar. It was used in production (it still is).
The reason we used a graph database was that the data being stored by the system and the operations the system was doing with the data were exactly the weak spot of relational databases and were exactly the strong spot of graph databases. The system needed to store collections of objects that lack a fixed schema and are linked together by relationships. To reason about the data, the system needed to do a lot of operations that would be a couple of traversals in a graph database, but that would be quite complex queries in SQL.
The main advantages of the graph model were rapid development time and flexibility. We could quickly add new functionality without impacting existing deployments. If a potential customer wanted to import some of their own data and graft it on top of our model, it could usually be done on site by the sales rep. Flexibility also helped when we were designing a new feature, saving us from trying to squeeze new data into a rigid data model.
Having a weird database let us build a lot of our other weird technologies, giving us lots of secret-sauce to distinguish our product from those of our competitors.
The main disadvantage was that we weren't using the standard relational database technology, which can be a problem when your customers are enterprisey. Our customers would ask why we couldn't just host our data on their giant Oracle clusters (our customers usually had large datacenters). One of the team actually rewrote the database layer to use Oracle (or PostgreSQL, or MySQL), but it was slightly slower than the original. At least one large enterprise even had an Oracle-only policy, but luckily Oracle bought Berkeley DB. We also had to write a lot of extra tools - we couldn't just use Crystal Reports for example.
The other disadvantage of our graph database was that we built it ourselves, which meant when we hit a problem (usually with scalability) we had to solve it ourselves. If we'd used a relational database, the vendor would have already solved the problem ten years ago.
If you're building a product for enterprisey customers and your data fits into the relational model, use a relational database if you can. If your application doesn't fit the relational model but it does fit the graph model, use a graph database. If it only fits something else, use that.
If your application doesn't need to fit into the current blub architecture, use a graph database, or CouchDB, or BigTable, or whatever fits your app and you think is cool. It might give you an advantage, and its fun to try new things.
Whatever you chose, try not to build the database engine yourself unless you really like building database engines.

We've been working with the Neo team for over a year now and have been very happy. We model scholarly artifacts and their relationships, which is spot on for a graph db, and run recommendation algorithms over the network.
If you are already working in Java, I think that modeling using Neo4j is very straight forward and it has the flattest / fastest performance for R/W of any other solutions we tried.
To be honest, I have a hard time not thinking in terms of a Graph/Network because it's so much easier than designing convoluted table structures to hold object properties and relationships.
That being said, we do store some information in MySQL simply because it's easier for the Business side to run quick SQL queries against. To perform the same functions with Neo we would need to write code that we simply don't have the bandwidth for right now. As soon as we do though, I'm moving all that data to Neo!
Good luck.

Two points:
First, on the data I've been working with the past 5 years in SQL Server, I've recently hit the scalability wall with SQL for the type of queries we need to run (nested relationhsips...you know...graphs). I've been playing around with neo4j, and my lookup times are several orders of magnitude faster when I need this kind of lookup.
Second, to the point that graph databases are outdated. Um...no. Early on, as people were trying to figure out how to store and lookup data efficiently, they created and played with graph and network style database models. These were designed so the physical model reflected the logical model, so their efficiency wasnt that great. This type of data structure was good for semi-structured data, but not as good for structured dense data. So, this IBM dude named Codd was researching efficient ways to arrange and store structured data and came up with the idea for the relational database model. And it was good, and people were happy.
What do we have here? Two tools for two different purposes. Graph database models are very good for representing semi-structured data and the relationships between entities (that may or may not exist). Relational databases are good for structured data that has a very static schema, and where join depths do not go very deep. One is good for one kind of data, the other is good for other kinds of data.
To coin the phrase, there is no Silver Bullet. Its very short sighted to say that graph database models are out of date and to use one gives up 40 years of progress. That's like saying using C is giving up all the technological progress we've gone through to get things like Java and C#. That's not true though. C is a tool that is needed for certain tasks. And Java is a tool for other tasks.

I've been using MySQL for years to manage engineering data, and it worked well, but one of the problems we had (but didn't realise we had) was that we always had to plan the schema up-front. Another problem we knew we had was mapping the data up to domain objects and back.
Now we've just started trying out neo4j and it looks like it is solving both problems for us. The ability to add different properties to each node (and relation) has allowed us to re-think our entire approach to data. It is like dynamic versus static languages (Ruby versus Java), but for databases. Building the data model in the database can be done in a much more agile and dynamic way, and that is dramatically simplifying our code.
And since the object model in code is generally a graph structure, mapping from the database is also simpler, with less code and consequently fewer bugs.
And as a additional bonus, our initial prototype code for loading our data into neo4j is actually performing faster than the previous MySQL version. I have no solid numbers on this (yet), but that was a nice additional feature.
But at the end of the day, the choice probably should be based mostly on the nature of your domain model. Does it map better to tables or graphs? Decide by doing some prototypes, load the data and play with it. Use neoclipse to look at different views of the data. Once you've done that, hopefully you know if you're on to a good thing or not.

Here is a good article that talks about the needs that non relational databases fill: http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php
It does a good job at pointing out (aside from the name) that relational databases arent flawed or wrong, its just that these days people are starting to process more and more data in mainstream software and web sites, and that relational databases just wont scale for these needs.

I am building an intranet at my company.
I am interested in understanding how to load data that was stored in tables (Oracle, MySQL, SQL Server, Excel, Access, various random lists) and loading it into Neo4J, or some other graph database. Specifcally, what happens when common data overlaps existing data already in the system.
Yes, I know some data is best modeled in RDBMS, but I have this idea itching me, that when you need to superimpose several distinct tables, the graph model is better than the table structure.
For instance, I work in a manufacturing environment. There is a major project we are working on and because of the complexity, each department has created a seperate Excel spreadsheet that has a BOM (Bill Of Materials) hierarchy in a column on the left and then several columns of notes and checks made by individuals who made these sheets.
So one of the problems is merging all these notes together into one "view" so that someone can see all the issues that need to be addressed in any particular part.
The second problem is that an Excel spreadsheet sucks at representing a hierarchial BOM when a common component is used in more than one subassembly. Meaning that, if someone writes a note about the P34 relay in the ignition subassembly, the same comment should be associated with the P34 relays used in the motor driver subassembly. This won't occur in the excel spreadsheet.
For the company intranet, I want to be able to search for anything easily. Such as data related to a part number, a BOM structure, a phone number, an email address, a company policy, or procedure. I want to even extend this to manage computer hardware assets, and installed software.
I envision that once the information network starts to get populated you can start doing cool traversals such as "I want to write an email to everyone working on the XYZ project". People will have been associated with the project because they will be tagged as creating and modifying the data within the XYZ project. So by using the XYZ project as a search key, a huge set with everything related to the XYZ project will be created. Including links to people who built the XYZ project. The people links will connect to their email addresses. So by their involvement in the XYZ project, they will be included in my email. This is in stark contrast to some secretary trying to maintain a list of people work on the project. We generate a lot of lists. We spend a lot of time maintaining lists and making sure they are up to date. And most of it doesn't add any value to our products.
Another cool traversal could report all the computers that have a certain piece of software installed, by version. That report could be used to generate tasks to remove extra copies of old software and to update people who need to have the latest copy. It would also be useful for license tracking.

might be a bit late, but there is a growing number of projects using Neo4j, the better known ones listed at Neo4j . Also NeoTechnology, the company behind Neo4j, has some references at their customers page
Note: I am part of the Neo4j team

Related

When is a flat DB design acceptable

When is it ok to use a flat DB table design nowadays. Ever? What I mean is when is it ok to abandon the wisdom of relational database design and revert back a flat table structure that incorporates no links, adding extra columns to add more data, when we should be creating a key to another table to store multiple rows.
I'm working on some ideas to discuss with a product management team. When I initially asked the question "Why are all these tables flat in nature" I was told that
"Read centric databases display better performance with a flat table structure."
I struggle with this explanation b/c a flat design present so many barriers to progress down the road.
Thoughts?
"Read centric databases display better performance with a flat table structure." This statement says table won't/rarely be used to insert/update/delete operations. In that case table must be properly indexed to get good performance. Since there won't be any kind of joins so table would be using lot of filters in where clause hence indexing is really important to be used appropriately.
This kind of scenario is usually used in data warehouses. When we designs warehouses, we usually eliminates primary/foreign keys and uses business primary keys. This is because of huge database in wareshouse.
Never.
Whatever problem you think you are going to solve by ignoring relational database theory, you will only create many more intractable problems. Furthermore, the original problem that you attempt to avoid by ignoring relational theory will invariably be based on a misconception anyway.
Short answer: Almost always!
Your website almost never needs conventional database!
After 20 years of working as an IT admin with big and small projects I can say with confidence that over 90% of todays websites do not need DataBase AT ALL.
It's just another layer of obfuscation that most companies and people can do without.
Face the facts people. Most websites out there don't get a single hit in a day so talking about DataBase performance is quite silly when it comes to HUGE majority of websites today (2019).
That means that over 90% of these sites could and should switch to some flat file CMS/CMR like PageKit, Grav or Bludit (It's my personal favorite because of its minimalistic approach. It disdains flatDB and uses ordinary folders to contain articles in HTML files.)
I never did figure out why CMS leaders like WordPress and Joomla insist on complicating their default setup by forcing their users to use DataBase connection and configuration that's often the reason the site malfunctions. If and only when site actually needs some type of DB like for instance if it has many user accounts then DB is warranted. Still, most websites have only a hand-full user accounts.
Many times we see some site down because the DataBase engine is down or can't handle so many simultaneous connections while Apache or NginX web-servers are still up and running.
Don't just follow others blindly. Time to be brave and lead.

why wordpress does not use views or stored procedures

I installed a wordpress blog and was tinkering with the database,
I noticed they are not using any sotred procedures or views why is this?
Or is it just not available for wordpress.org users and some premium feature for paid wordpress.com members?
Is it not advisable to use these to improve performance considering wordpress stores almost everything except media files in database.
Are there any resources / attempts to optimize wp database using these ?
The decision regarding where to keep transformations of / operations on data is heavily rooted in the concept of what you consider to be the central interface to the data within the application as a whole.
If you're a database programmer, you're much more likely to consider that central point to be the database. In this view, the data is the center, and the surrounding application can be thought of as just an interface on top of that data. This view makes sense when dealing with anything where data itself is key. I.e., where the data will stay put over time, and the ways in which the data is accessed, or the things which you want to do with the data will change over time. Examples which fit well into this view include: Financial systems, Healthcare records, Customer data, Phone records... pretty much anything that has a lot of ways of looking at the data, and is constantly growing.
If you're an application programmer, the data itself may be almost secondary. In this view, the data is transient. Where and how that data is stored is even less important. The MVC pattern encourages the database to be utterly replaceable, and strongly discourages putting any sort of logic related to anything other than basic data integrity into the the database. There is certainly nothing about the MVC pattern or other application-centric development practices which argue specifically against stored procedures or views, but there is much less room for them to be useful. Examples which fit well into this view inclue: Blogs, Message-boards, Stand-alone Documents... pretty much anything that has a very simple structure, does not have complex relations, and can be divided easily into self-contained units. Anything for which "what you can do" is tied closely in concept to "what you are doing it to".
A summary of the two above-mentioned viewpoints is that there are tools for which examining data is more important (data-centric), and there are tools for which creating data is more important (application-centric).
Another way of looking at it is that Stored Procedures and Views are just interfaces on top of a database. Wordpress is also an interface on top of a database, it's just written in PHP.
Well, I don't know their rationale for a fact but my guess would be that since MySQL actually stores the procedures in the "mysql" database - not the wordpress database where the tables are - that they did it because it can be an access issue. Let's say you have a DB server supporting multiple WP databases. All the procedures get put into the "mysql" database. So when you backup your WP database you don't get any of the procedures. You'd need to back up the mysql (system) database, and its likely the users would not have the rights to do so in such an environment, which is the typical environment for WP installs.
Excellent answers. To add, I think that from a plugin coding side, it is easier to update just the file system and do as little database work on an as needed basis.
Especially if a plugin update doesn't install right the first time and you have to restore the files and try again, a database change would be a lot more difficult to reverse.

How to present a database design?

I am doing a project in the university and it includes a MySQL database. I have a design for the database in terms of a list of tables and their respective fields.
In what form should I present this design? Just the list of tables and content? In an ERD? How do you present your designs?
To clarify - whatever you answer, I expect not only specification of how you present your design, but also which tools do you use the create the diagrams/list/tables etc.
ERD is the only way to go. As they say, a picture is worth a thousand words.
But don't try to put the whole database on one diagram. It will, in all but the most trivial cases, be overwhelming to your audience to try to digest the entire database design in one go. Instead, break the diagrams into subject areas depicting only the most relevant tables in each diagram. For example, a point-of-sale system might have separate diagrams for Inventory, Sales, Accounting, Customer Management, Security, Auditing, and Reporting. Some tables will show up in more than one subject area -- this is to be expected.
As far as tooling, nothing beats ErWin, but it is really expensive and only available for Windows. Visio is ubiquitous in a corporate environment, but is only available on Windows and is not exactly cheap either. Macs offer some really nice diagramming tools; most of them are not free.
Dia is a decent, free, and cross-platform diagramming tool. It is a bit quirky, though; and I have not had much success making the diagrams look as nice I want them to look.
For MySQL, I have played with fabFORCE dbDesigner and it is not bad, but I did find its support for multiple subject areas to be a bit lacking at the time -- perhaps they've improved it since. But it is free and works on Windows and Linux.
For the actual presentation, I create images from these diagramming tools and pull them into presentation software (PowerPoint, KeyNote, or OpenOffice Impress). These presentations can be exported to PDF and distributed to the audience; they won't need anything more than a PDF viewer to review the information later.
Let's look at this from your professor's perspective. If I were him/her:
I would require an ERD. Without it, I cannot see one of the most fundamental issues of a database design, how are the tables related.
I would also expect some basic use cases/ requirements. What problems are you trying to solve with this database design?
I would want to see what indexes are in place, especiall on the foreign key columns. I would want to see expected row counts in all tables to determine if indexes are even required.
I would want to see column data types to determine if they meet the requirements. I would want to see what columns accept NULL values, since that often can cause problems if you're not careful.
If I were using SQL Server, I would probably create a diagram in SSMS to display a somewhat basic ERD. Visio can be used as well. I might use Visio to create my use cases, or perhaps Microsoft Word.
mysql workbench will make you pretty graphics for presentation amongst other many sophisticated features.
Depends on the audience. ERD certainly isn't the only answer and may not be the best. You should choose a medium that your audience will understand.
Don't forget to discuss design aspects that can't fit to ERD:
1) how inheritance/aggregation relationships from your analytical model implemented in your db.
2) how you are going to support hierarchies of your objects in the rdb (if you have any)
3) list relationships that are in your analytical model but are not supported by the rdb design.
4) ETL process, track changes, track schema changes, security based on resource.
5) storage partitioning and maintenance aspects (one of the goal optimize backup time)
6) in prod test (test island data) and easy cloning db for test environment

NoSql/Raven DB implementation best practices

I'm investigating a new project which will be a social networking style site. I'm reading up on RavenDb and I like the look of a lot of its features. I've not read up on nosql all that much but I'm wondering if there's a niche it fits best with and old school sql is still the best choice for other stuff.
I'm thinking that the permissions plug in would be ideal for a social net style site - but will it really perform in an environment where the database will be getting hammered - or is it optimised for a more reporting style system where it's possible to keep throwing new data structures at the database and report on those structures.
I'm eager to use the right tool for the job - I'll be using MVC3, Windsor + either Nhibernate+Sql server or RavenDb.
Should I stick with the old school sql or go with the new kid on the block: ravendb?
This question can get very close to being subjective (even though it's really not), you're talking about NoSQL as if it is just one thing, and that is not the case.
You have
graph databases (Neo4j etc),
map/reduce style document databases (Couch,Raven),
document databases which attempt to feel like ordinary databases (Mongo),
Key/value stores (Cassandra etc)
moar goes here.
Each of them attempts to solve a different problem via different means, and whether you'd use one of them over a traditional relational store is
A matter of suitability
A matter of personal preference
At the end of the day, for the primary data-storage for a single system, a document database or relational store is probably what you want, although for different parts of your system you may well end up utilising a graph database (For calculating neighbours etc), or a key/value store (like Facebook does/did for inbox messages).
The main benefit of choosing a document store as your primary store over that of a relational one, is that you haven't got to worry about trying to map your objects into a collection of tables, and there is less configuration overhead involved in doing so.
The other downside/upside would be that you have to learn something new and make mistakes along the way.
So my answer if I am going to be direct?
RavenDB would be suitable
SQL would be suitable
Which do you prefer to use? These days I'd probably just go for Raven, knowing I can dump data into a relational store for reporting purposes and probably do likewise for other parts of my system, and getting free-text search and fastish-writes/fast-reads without going through the effort of defining separate read/write stores is an overall win.
But that's me, and I am biased.

How would you design your database to allow user-defined schema

If you have to create an application like - let's say a blog application, creating the database schema is relatively simple. You have to create some tables, tblPosts, tblAttachments, tblCommets, tblBlaBla… and that's it (ok, i know, that's a bit simplified but you understand what i mean).
What if you have an application where you want to allow users to define parts of the schema at runtime. Let's say you want to build an application where users can log any kind of data. One user wants to log his working hours (startTime, endTime, project Id, description), the next wants to collect cooking recipes, others maybe stock quotes, the weekly weight of their babies, monthly expenses they spent for food, the results of their favorite football teams or whatever stuff you can think about.
How would you design a database to hold all that very very different kind of data? Would you create a generic schema that can hold all kind of data, would you create new tables reflecting the user data schema or do you have another great idea to do that?
If it's important: I have to use SQL Server / Entity Framework
Let's try again.
If you want them to be able to create their own schema, then why not build the schema using, oh, I dunno, the CREATE TABLE statment. You have a full boat, full functional, powerful database that can do amazing things like define schemas and store data. Why not use it?
If you were just going to do some ad-hoc properties, then sure.
But if it's "carte blanche, they can do whatever they want", then let them.
Do they have to know SQL? Umm, no. That's your UIs task. Your job as a tool and application designer is to hide the implementation from the user. So present lists of fields, lines and arrows if you want relationships, etc. Whatever.
Folks have been making "end user", "simple" database tools for years.
"What if they want to add a column?" Then add a column, databases do that, most good ones at least. If not, create the new table, copy the old data, drop the old one.
"What if they want to delete a column?" See above. If yours can't remove columns, then remove it from the logical view of the user so it looks like it's deleted.
"What if they have eleventy zillion rows of data?" Then they have a eleventy zillion rows of data and operations take eleventy zillion times longer than if they had 1 row of data. If they have eleventy zillion rows of data, they probably shouldn't be using your system for this anyway.
The fascination of "Implementing databases on databases" eludes me.
"I have Oracle here, how can I offer less features and make is slower for the user??"
Gee, I wonder.
There's no way you can predict how complex their data requirements will be. Entity-Attribute-Value is one typical solution many programmers use, but it might be be sufficient, for instance if the user's data would conventionally be modeled with multiple tables.
I'd serialize the user's custom data as XML or YAML or JSON or similar semi-structured format, and save it in a text BLOB.
You can even create inverted indexes so you can look up specific values among the attributes in your BLOB. See http://bret.appspot.com/entry/how-friendfeed-uses-mysql (the technique works in any RDBMS, not just MySQL).
Also consider using a document store such as Solr or MongoDB. These technologies do not need to conform to relational database conventions. You can add new attributes to any document at runtime, without needing to redefine the schema. But it's a tradeoff -- having no schema means your app can't depend on documents/rows being similar throughout the collection.
I'm a critic of the Entity-Attribute-Value anti-pattern.
I've written about EAV problems in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Here's an SO answer where I list some problems with Entity-Attribute-Value: "Product table, many kinds of products, each product has many parameters."
Here's a blog I posted the other day with some more discussion of EAV problems: "EAV FAIL."
And be sure to read this blog "Bad CaRMa" about how attempting to make a fully flexible database nearly destroyed a company.
I would go for a Hybrid Entity-Attribute-Value model, so like Antony's reply, you have EAV tables, but you also have default columns (and class properties) which will always exist.
Here's a great article on what you're in for :)
As an additional comment, I knocked up a prototype for this approach using Linq2Sql in a few days, and it was a workable solution. Given that you've mentioned Entity Framework, I'd take a look at version 4 and their POCO support, since this would be a good way to inject a hybrid EAV model without polluting your EF schema.
On the surface, a schema-less or document-oriented database such as CouchDB or SimpleDB for the custom user data sounds ideal. But I guess that doesn't help much if you can't use anything but SQL and EF.
I'm not familiar with the Entity Framework, but I would lean towards the Entity-Attribute-Value (http://en.wikipedia.org/wiki/Entity-Attribute-Value_model) database model.
So, rather than creating tables and columns on the fly, your app would create attributes (or collections of attributes) and then your end users would complete the values.
But, as I said, I don't know what the Entity Framework is supposed to do for you, and it may not let you take this approach.
Not as a critical comment, but it may help save some of your time to point out that this is one of those Don Quixote Holy Grail type issues. There's an eternal quest for probably over 50 years to make a user-friendly database design interface.
The only quasi-successful ones that have gained any significant traction that I can think of are 1. Excel (and its predecessors), 2. Filemaker (the original, not its current flavor), and 3. (possibly, but doubtfully) Access. Note that the first two are limited to basically one table.
I'd be surprised if our collective conventional wisdom is going to help you break the barrier. But it would be wonderful.
Rather than re-implement sqlservers "CREATE TABLE" statement, which was done many years ago by a team of programmers who were probably better than you or I, why not work on exposing SQLSERVER in a limited way to the users -- let them create thier own schema in a limited way and leverage the power of SQLServer to do it properly.
I would just give them a copy of SQL Server Management Studio, and say, "go nuts!" Why reinvent a wheel within a wheel?
Check out this post you can do it but it's a lot of hard work :) If performance is not a concern an xml solution could work too though that is also alot of work.

Resources