NoSQL / RDBMS hybrid with referential integrity (delete cascade)? - database

Is there a database out there that gives you the benefit of referential integrity and being able to use a SQL type language for querying, but also lets entities be loosely defined with respect to their data attributes and also the relationships between them?
E.g. take a RBAC type model where you have Permissions, Users, User Groups & Roles. A complex/flexible model could have the following rules:
Roles can have one or more permissions and a permission can belong to one or more Roles
Users can have one or more permissions and a permission can belong to one or more Users
Users Groups can have one or more permissions and a permission can belong to one or more Users Groups
Users can have one or more roles and a role can belong to one or more Users
User Groups can have one or more roles and a role can belong to one or more User Groups
Roles can have one or more roles and a role can belong to one or more Roles
To model the above in an RDBMS would involve the creation of lots of intersection tables. Ideally, all I'd like to define in the database is the entities themselves (User, Role, etc) plus some mandatory attributes. Everything else would then be dynamic (i.e. no DDL required), e.g. I could create a User with a new attribute which wasn't pre-defined. I could also create a relationship between entities that hasn't been predefined, though the database would handle referential integrity like a normal RDBMS.
The above can be achieved to some degree in a RDBMS by creating a table to store entities and another one to store relationships etc, but this overly complicates the SQL needed to perform simple queries and may also have performance implications.

Most NoSQL databases are built to scale very well. This is done at the cost of consistency, of which referential integrity is part of. So most NoSQL don't support any type of relational constraints.
There's one type of NoSQL database that does support relations. In fact, it's designed especially for relations: the graph database. Graph databases store nodes and explicit relations (edges) between these nodes. Both nodes and edges can contain data in the form of key/value pairs, without being tied to a predefined schema.
Graph databases are optimized for relational queries and nifty graph operations, such as finding the shortest path between two nodes, or finding all nodes within a given distance from the current node. You wouldn't need this in a role/permission scenario, but if you do, it'll be a lot harder to achieve using an RDBMS.
Another option is to make your entire data layer a hybrid, by using a RDBMS to store the relations and a document database to store the actual data. This would complicate your application slightly, but I don't think it's such a bad solution. You'll be using two different technologies, both dealing with the problems they were designed to deal with.

Given the requirements you specify in your question, a graph database is probably the sort of thing you are looking for, but there are other options. As #Niels van der Rest said, the two constraints of "no a priori schema" and "referential integrity" are very hard to reconcile. You might be able to find a Topic-Map based database that might do so, but I'm not familiar with specific implementations so I couldn't say for sure.
If you decide you really can't do without referential integrity, I fear you probably are stuck with an RDBMS. There are some tricks you can use that might avoid some of the problems you anticipate, I cover a couple in https://stackoverflow.com/questions/3395606..., which might give you some ideas. Still, for this sort of data-model requiring dynamic, post-priori schema, with meta-schema elements, an RDBMS is always going to be awkward.
If you are willing to forgo referential integrity, then you still have three approaches to consider.
Map/Reduce - in two flavours: distributed record-oriented (think, MongoDB), and column-oriented (think, Cassandra). Scales really really well, but you won't have your SQL-like syntax; joins suck; and matching your architecture to your specific query types is critical. In your case your focus on the entities and their attributes, rather than the relationships between the entities themselves, so I would probably consider a distributed record-oriented store; but only if I expected to need to scale beyond a single node—they do scale really really well.
Document-store - technically in two flavours, but one of them is a distributed record-oriented map/reduce datastore discussed above. The other is an inverted-index (think, Lucene/Solr). Do NOT disregard the power of an inverted-index; they can resolve obscenely complex record predicates amazingly fast. What they can't do is handle well is queries that include correlation or large relational joins. Still, you will be amazed at the incredible flexibility, sufficiently complex record predicates gives you.
Graph-store - come in a few flavours the first is the large-scale, ad-hoc key-value store (think, DBM/TokyoTyrant); the second is the tuple-space (think, Neo4j); the third is the RDF database (think, Sesame/Mulgara). I have a soft-spot for RDF, having helped develop mulgara, so I am not the most objective commenter. Still, if your scalability constraints will permit you to use an RDF-store, I find the inferencing permitted by RDF's denotational semantics (rare amongst noSQL datastore options) invaluable.

Some NoSQL solutions support security and SQL. One of these is OrientDB. The security system is (quite) well explained here.
Furthermore supports SQL.

There's the Gremlin language, supported by the Neo4j graph database. Regarding your example, have a look at Access control lists the graph database way and here. There's also a web-based tool including a REST API to Neo4j and a Gremlin console, see neo4j/webadmin.

You may want to check out MongoDB it is a document based database and so has a flexible schema. It is awesome and worth the time to see if it would suite your needs.

Related

Do graph databases deprecate relational databases?

I'm new to DBs of any kind. It seems you can represent any relational database in graph form (although it might be a very flat graph), and any graph database in a relational database (with enough tables).
A graph can avoid a lot of lookups in other tables by having a hard link from one entry to another, so in many/most cases I can see the speed advantage of a graph. If your data is naturally hierarchical, and especially if it forms a tree, I see the logical/reasoning benefit to a graph over relational. I imagine a node of a graph which links to other nodes probably contains multiple maps or lists... which is effectively containing a relational DB within nodes of a graph.
Are there any disadvantages to a graph db vs a relational db? (Note: I'm not looking to things like missing features in implementations, but instead the theoretical pros/cons)
When should I still use a relational database? Even if I logically have a single mapping of an int to int I could do it in a graph.
Graph databases were deprecated by relational-ish technology some 20 to 30 years ago.
The major theoretical disadvantage is that graph databases use TWO basic concepts to represent information (nodes and edges), whereas a relational database uses only one (the relation). This bleeds over into the language for data manipulation, in that a graph-based language must provide two distinct sets of operators : one for operating on nodes, and one for operating on edges. The relational model can suffice with only one.
More operators means more operators to implement for the DBMS builder, more opportunity for bugs, and for the user it means more distinct language constructs to learn. For example, adding information to a database is just INSERT in relational, in graph-based it can be either STORE (nodes) or CONNECT (edges). Removing information is just DELETE (relational), as opposed to either ERASE (nodes) or DISCONNECT (edges).
Building on Erwin Smout's fine answer, an important reason why the relational model supplanted the graph one is that a graph has a greater degree of "bias" baked into its structure than relations do. The edges of a graph are navigational links which user queries are expected to traverse in a particular way. A relational model of the same data assumes much less about how the data will be used. Users are free to join and manipulate relational data in ways that the database designer might not have foreseen. The disruptive costs of re-engineering graph database structures to support new requirements were a factor which drove the adoption of the relational model and its SQL-based offshoots in the 1980s.
Relational databases were designed to aggregate data, graph to find relations.
If You have for example a financial domain, all connections are known, You only aggregate data by other data to find sums and so on.
Graph databases are better in more chaotic domain where to connections are more important, and not all connections are apparent, for example:
networks of people, with different relations with one and other
films and people creating them. Not just actors but the whole crew.
natural language processing and finding connections between recognized words
Data model is important, but what matters more is how you access your data. Notice, there are very few (none, actually) sharded or otherwise distributed graph databases out there. If you compare insertion speed into a typical relational database and a graph database, your relational database will most likely win.
Yes, graph model is more versatile than relational model, but it doesn't make it universal - in some cases, this versatility is a roadblock for optimizations.
In fact, modern graph databases are a niche solutions for a narrow set of tasks - finding a route from A to B, working with friends in a social network, information technology in medicine.
For most business applications relational databases continue to prevail.
I'm missing the performance aspect in the answers above.
Performance of graph based data bases is inherently worse for scalar and maybe even tree based models. Only if you have a real graph, they may exhibit better performance.
Also most graph DBs do not feature ACID support such as almost any RDBMS.
From my real life experience I can tell almost any evolving data model will sooner or later become a graph and that's why graph DBs are superior in terms of flexibility and agility (they keep pace with the evolution of your data model).
That's why I don't think that RDBs will prevail for "For most business applications" as #Kostja says. I think they will prevail where ACID capability is essential.

How to handle different organization with single DB?

Background
building an online information system which user can access through any computer. I don't want to replicate DB and code for every university or organization.
I just want user to hit a domain like www.example.com sign in and use it.
For second user it will also hit the same domain www.example.com sign in and use it. but the data for them are different.
Scenario
suppose a university has 200 employees, 2nd university has 150 and so on.
Qusetion
Do i need to have separate employee table for each university or is it OK to have a single table with a column that has University ID?
I assume 2nd is best but Suppose i have 20 universities or organizations and a total of thousands of employees.
What is the best approach?
This same thing is for all table? This is just to give you an example.
Thanks
The approach will depend upon the data, usage, and client requirements/restrictions.
Use an integrated model, as suggested by duffymo. This may be appropriate if each organization is part of a larger whole (i.e. all colleges are part of a state college board) and security concerns about cross-query access are minimal2. This approach has a minimal amount of separation between each organization as the same schema1 and relations are "openly" shared. It leads to a very simple model initially, but it can become very complicated (with compound FKs and correct usage of such) if needing relations for organization-specific values because it adds another dimension of data.
Implement multi-tenancy. This can be achieved with implicit filters on the relations (perhaps hidden behinds views and store procedures), different schemas, or other database-specific support. Depending upon implementation this may or may not share schema or relations even though all data may reside in the same database. With implicit isolation, some complicated keys or relationships can be hidden/eliminated. Multi-tenancy isolation also generally makes it harder/impossible to cross-query.
Silo the databases entirely. Each customer or "organization" has a separate database. This implies separate relations and schema groups. I have found this approach to to be relatively simple with automated tooling, but it does require managing multiple database. Direct cross-querying is impossible, although "linked databases" can be used if there is a need.
Even though it's not "a single DB", in our case, we had the following restrictions 1) not allowed to ever share/expose data between organizations, and 2) each organization wanted their own local database. Thus, our product ended up using a silo approach. Make sure that the approach chosen meets customer requirements.
None of these approaches will have any issue with "thousands", "hundreds of thousands", or even "millions" of records as long as the indices and queries are correctly planned. However, switching from one to another can violate many assumed constraints and so the decision should be made earlier on.
1 In this response I am using "schema" to refer to the security grouping of database objects (e.g. tables, views) and not the database model itself. The actual database model used can be common/shared, as we do even when using separate databases.
2 An integrated approach is not necessarily insecure - but it doesn't inherently have some of the built-in isolation of other designs.
I would normalize it to have UNIVERSITY and EMPLOYEE tables, with a one-to-many relationship between them.
You'll have to take care to make sure that only people associated with a given university can see their data. Role based access will be important.
This is called a multi-tenant architecture. you should read this:
http://msdn.microsoft.com/en-us/library/aa479086.aspx
I would go with Tenant Per Schema, which means copying the structure across different schemas, however, as you should keep all your SQL DDL in source control, this is very easy to script.
It's easy to screw up and "leak" information between tenants if doing it all in the same table.

Does SQL Server have native support for EAV

I love the flexible schema capabilities of CouchDB and MongoDB, but I also love the relational 'join' capability of SQL Server. What I really want is the ability to have tables such as PERSON, COMPANY and ORDER that are basically 'open-schema' where each table has an ID but the rest of the columns are defined json-style {ID:12,firstname:"Pete",surname:"smith",height:"180"}, but where I can efficiently join PERSON to COMPANY either directly or via a many-to-many xref table. Does anyone know if SQL Server has any plans to incorporate 'open schema' in SQL, or whether Mongo or Couch have plans to support efficient joining? Thanks very much.
CouchDB offers a number of ways to establish relationships between your various documents/entities. Check out this article on the wiki to get started.
The tendency, when coming from a relational background, is to continue using the same terminology and mindset whenever you try to solve problems. It's very important to understand that NoSQL solutions are very different, otherwise they have no real purpose for existing. You should really seek to understand how these various NoSQL solutions work so you can compare them with your application's requirements to see if it's an appropriate fit.
MongoDB = NoSQL = No Joins - never ever.
If you need JOINs due to your data model or project requirements: stay with a RDBMS.
Alternatives in MongoDB:
denormalization
using embedded documents
multiple queries
As much as this would be inefficient to Query on a large scale, from a technical standpoint, using the XML datatype would allow you to store whatever structure you wanted that can vary by row.
Not that I'm aware of, but it's not that hard to role your own EAV, it's only 3 tables after all :)
Entity stores the associated table name.
Attribute stores the column name, data type and whether it's nullable.
Value contains one nullable column for each required data type.
Entity 1..* Attribute 1..* Values
Assuming you're using .NET, define your EAV interfaces, create some POCO's and let Entity Framework or your ORM of choice wire up the associations for you. LINQ works great for this sort of operation.
It also allows you to create a hybrid model, where parts of the schema are known but you still want flexibility for custom data. If you design your domain model with this in mind (i.e. use the EAV interfaces in your model) the EAV can be baked in to the EF data context (or whatever) to automate the loading of attributes and their values for each entity. Your EF entity just needs to know which table entity it belongs to.
Of course it's not the perfect solution, as you're (potentially) trading performance for flexibility. Depending on the amount of data you want to persist and the performance requirements, it may be more suited to models where most of the schema is known and a smaller percentage is unknown. YMMV.

Using LDAP server as a storage base, how practical is it?

I want to learn how practical using an LDAP server (say AD) as a storage base. To be more clear; how much does it make sense using an LDAP server instead of using RDBMS to store data?
I can guess that most you might just say "it doesn't" but there might be some reasons to make it meaningful (especially business wise);
A few points first;
Each table becomes a container entity and each row becomes a new entity as a child. Row entities contains attributes for columns. So you represent your data in this way. (This should be the most meaningful representation I think, suggestions are welcome)
So storing data like a DB server is possible but lack of FK and PK (not sure about PK) support is an issue. On the other hand it supports attribute (relates to a column) indexing (Not sure how efficient). So consistency of data is responsibility of the application layer.
Why would somebody do this ever?
Data that application uses/stores closely matches with the existing data in AD. (Users, Machines, Department Info etc.) (But still some customization is required to existing entity schema, and new schema definitions are needed for not very much related data.)
(I think strongest reason would be this: business related) Most mid-sized companies have very well configured AD servers (replicated, backed-up etc.) but they don't have such DB setup (you can make comment to this as much as you want). Say when you sell your software which requires a DB setup to these companies, they must manage their DB setup; but if you say "you don't need DB setup and management; you can just use existing AD", it sounds appealing.
Obviously there are many disadvantages of giving up using DB, feel free to mention them but let's assume they are acceptable. (I can mention more if question is not clear enough.)
LDAP is a terrible tool for maintaining most business data.
Think about a typical one-to-many relationship - say, customer and orders. One customer has many orders.
There is no good way to represent this data in an LDAP directory.
You could try having a mock "foreign key" by making every entry of that given object class have a "foreign key" attribute, but your referential integrity just went out the window. Cascade deletes are impossible.
You could try having a "customer" object that has "order" children. However, you've just introduced a specific hierachy - you're now tied to it.
And that's the simplest use case. Once you start getting into more complex relationships, you're basically re-inventing an RDBMS in a system explicity designed for a different purpose. The clue's in the name - directory.
If you're storing a phonebook, then sure, use LDAP. For anything else, use a real database.
For relatively small, flexible data sets I think an LDAP solution is workable. However an RDBMS provides a number significant advantages:
Backup and Recovery: just about any database will provide ACID properties. And, RDBMS backups are generally easy to script and provide several options (e.g. full vs. differential). Just don't know with LDAP, but I imagine these qualities are not as widespread.
Reporting: AFAIK LDAP doesn't offer a way to JOIN values easily, much the less do things like calculate summations. So you would put a lot of effort into application code to reproduce those behaviors when you do need reporting. And what application doesn't ultimately?
Indexing: looks like LDAP solutions have indexing, but again, seems hit or miss. Whereas seemingly all databases out there have put some real effort into getting this right.
I think any serious business system's storage should be backed up in the same fashion you believe LDAP is in most environments. If what you're really after is its flexibility in terms of representing hierarchy and ability to define dynamic schemas I'd suggest looking into NoSQL solutions or the Java Content Repository.
LDAP is very usefull for storing that information and if you want it, you may use it. RDMS is just more comfortable with ORM systems. Your persistence logic with LDAP will so complex.
And worth mentioning that this is not a standard approach -> people who will support the project will spend more time on analysis.
I've used this approach for fun, i generate a phonebook from Active Directory, but i don`t think that it's good idea to use LDAP as a store for business applications.
In short: Use the right tool for the right job.
When people see LDAP you already set an expectation on your system. Don't forget that the L Lightweight. LDAP was designed for accessing directories over a network.
With a “directory database” you can build a certain type of application. If you can map your data to a tree like data structure it will work. I surely would not want to steam videos from LDAP! You can probably hack something but I would prefer a steaming server..
There might be some hidden gotchas down the line if you use a tool not designed for what it is supposed to do. So, the downside is you'll have to test stuff that would have been a given in some cases.
It's not is not just a technical concern. Your operational support team might “frown” on your application as they would have certain expectations/preconceptions based on your applications architectural nature. Imagine their surprise if you give them CRM system (website + files and popped email etc.) in a LDAP server as database to maintain.
If I was in your position, I would steer towards one of the NoSQL db solutions rather than trying to use LDAP. LDAP is fine for things like storing user and employee information, but is terrible to interact with when you need to make changes. A NoSQL db will allow you to store your data how you want without the RDBMS overhead you would like to avoid.
The answer is actually easy. Think of CRUD (Create, Read, Update, Delete). If a lot of Read will be made in your system, you can think of using LDAP. Because LDAP is quick in read operations and designed so. If the other operations will be made more, the RDMS would be a better option.

Database alternatives?

I was wondering the trade-offs for using databases and what the other options were? Also, what problems are not well suited for databases?
I'm concerned with Relational Databases.
The concept of database is very broad. I will make some simplifications in what I present here.
For some tasks, the most common database is the relational database. It's a database based on the relational model. The relational model assumes that you describe your data in rows, belonging to tables where each table has a given and fixed number of columns. You submit data on a "per row" basis, meaning that you have to provide a row in a single shot containing the data relative to all columns of your table. Every submitted row normally gets an identifier which is unique at the table level, sometimes at the database level. You can create relationships between entities in the relational database, for example by saying that a given cell in your table must refer to another table's row, so to preserve the so called "referential integrity".
This model works fine, but it's not the only one out there. In some cases, data are better organized as a tree. The filesystem is a hierarchical database. starts at a root, and everything goes under this root, in a tree like structure. Another model is the key/value pair. Sleepycat BDB is basically a store of key/value entities.
LDAP is another database which has two advantages: stores rather generic data, it's distributed by design, and it's optimized for reading.
Graph databases and triplestores allow you to store a graph and perform isomorphism search. This is typically needed if you have a very generic dataset that can encompass a broad level of description of your entities, so broad that is basically unknown. This is in clear opposition to the relational model, where you create your tables with a very precise set of columns, and you know what each column is going to contain.
Some relational column-based databases exist as well. Instead of submitting data by row, you submit them by whole column.
So, to answer your question: a database is a method to store data. Technically, even a text file is a database, although not a particularly nice one. The choice of the model behind your database is mostly relative to what is the typical needs of your application.
Setting the answer as CW as I am probably saying something strictly not correct. Feel free to edit.
This is a rather broad question, but databases are well suited for managing relational data. Alternatives would almost always imply to design your own data storage and retrieval engine, which for most standard/small applications is not worth the effort.
A typical scenario that is not well suited for a database is the storage of large amounts of data which are organized as a relatively small amount of logical files, in this case a simple filesystem-like system can be enough.
Don't forget to take a look at NOSQL databases. It's pretty new technology and well suited for stuff that doesn't fit/scale in a relational database.
Use a database if you have data to store and query.
Technically, most things are suited for databases. Computers are made to process data and databases are made to store them.
The only thing to consider is cost. Cost of deployment, cost of maintenance, time investment, but it will usually be worth it.
If you only need to store very simple data, flat files would be an alternative (text files).
Note: you used the generic term 'database', but there are many many different types and implementations of these.
For search applications, full-text search engines (some of which are integrated to traditional DBMSes, but some of which are not), can be a good alternative, allowing both more features (various linguistic awareness, ability to have semi-structured data, ranking...) as well as better performance.
Also, I've seen applications where configuration data is stored in the database, and while this makes sense in some cases, using plain text files (or YAML, XML and such) and loading the underlying objects during initialization, may be preferable, due to the self-contained nature of such alternative, and to the ease of modifying and replicating such files.
A flat log file, can be a good alternative to logging to DBMS, depending on usage of course.
This said, in the last 10 years or so, the DBMS Systems, in general, have added many features, to help them handle different forms of data and different search capabilities (ex: FullText search a fore mentioned, XML, Smart storage/handling of BLOBs, powerful user-defined functions, etc.) which render them more versatile, and hence a fairly ubiquitous service. Their strength remain mainly with relational data however.

Resources