Expressing Relationships between Entities Informally - data-modeling

What are some of the different ways/illustrations people use in order to convey relationships between entities. I'm looking to quickly express relationships between different entities in a sub-system, without getting too complicated; keeping it high level but conveying the relationships clearly. I've heard of and used ERDs in the past, but I'm looking for an alternate solution that:
Is clear/concise
Can be quickly mocked, in say, an email
Conveys the relationships between entities clearly

Related

Modelling a graph like entity relationship in app engine

I have a project tracking application. The app has the following entities:
project
story - belong to a particular project
user - belong to a particular project, assigned to a particular story
Each project can have multiple story and user entities as descendants. Each story can be parent to multiple user entities. Basically, every project has several users that can work on the various stories (tasks) within the project. Each story can have multiple users assigned to it as well. Something like below:
Now my question is, can i model such a relationship in the app engine datastore using ancestor queries without causing an index explosion? For example, i can find out stories within a project with a simple query. But to find out stories assigned to a specific user would require traversing the entire story index (which isn't really an issue due to query performance being independent of index table size), but would it not be better to have the query reflect a graph like relationship here? As if modeled using a graph database like neo4j?
If a user can work on multiple stories, or none at the moment, and/or can ever change the story they're working on (get assigned to a different story later), then modeling the story as "parent" to the user seems deeply wrong on a semantic level -- it may also cause performance issues (depending on kind of queries, frequency of reads and writes, etc, etc), but, that's quite a secondary worry, I'd focus on the semantic correctness first and I'm not entirely sure about the specific semantics of your data model.
A parent "relationship", in GAE's datastore, is intended to model a persistent (actually I'd say "permanent", in terms of the child entity's lifetime:-) 1:many connection -- especially one that may well require transactional behavior (or even just strong consistency) among parent and child, or among siblings (transactional behavior and strong consistency don't come for free, performance-wise -- but, when you need them, you really need them:-). How well does the connection between story and user in your app match this summary?
There are of course other ways you can model persistent 1:many connections; using ndb concepts, a StructuredProperty can in fact let you embed the "child" entity "inside" the "parent" one (and if you don't need queries on the child's attributes you can gain a speed boost by using the local kind of structured property).
And of course, the most general way to model any kind of relationship is with KeyProperty -- that doesn't require the relationship to be persistent/permanent, nor necessarily 1:many (e.g if a user can be assigned to multiple stories). In fact you can view key properties as edges in a directed graph where entities are the nodes, with full generality (indeed it can be a multi directed-graph, with 0+ edges from node A to node B, if you need even more generality than a "mere" graph can provide:-). But of course you can pay some price for such broad generality, if you don't really need it nor use it.
In the end, beyond complete clarity in the entity-relationship modeling (which is a good thing no matter what kind of db is underlying:-), the choice of "schema" (in the broadest sense of the word:-) for a NoSQL database is strongly dependent on what queries, updates, &c, the app will require, with what frequency, tolerance for latency, transactionality requisites, consistency (strong vs eventual), ... to a higher degree than for the relational databases that are what I think most of us "cut our teeth on". Thus I would encourage you to strive to make both aspects very explicit -- the E-R layer of abstraction, of course, but also the mix of queries, updates, &c, and the constraints and desiderata on them.

Many-to-many relationship ERD

Good day,
Real estate companies have several Buildings, each Building managed by one or more Managers, Managers have access to one or more Buildings. So, there is a many-to-many relationship between Managers and Buildings. It has to be a table such as Permissions to get rid of many-to-many relationship.
Please help me to figure it out, what is the best design for the database ?
I came up with a two candidate diagrams, which one is better? If neither of them are good, what should I change ?
http://i.stack.imgur.com/Z0l6h.png
http://i.stack.imgur.com/Dg5Sv.png
Sincerely
The second picture seems closest
I'd suggest moving the boxes around a little to show the hierarchy. Put Companies top and center, then on the next row, Managers on the left, Buildings on the right and Permissions between those two.
ER diagrams are used for two different purposes. One purpose is to illustrate the subject matter entities, and the relationships between them, as understood by subject matter experts. This is called a conceptual model of the data.
The other purpose is to illustrate a proposed database design, one where the relationships are not only expressed, but also implemented somehow. If the design is relational (which it usually is) many-to-many relationships are expressed by creating an intermediate table. This is called a physical model of the data (in some literature it's called a logical model). This is what you have done in your second diagram.
Your first diagram could be cleaned up a little by eliminating the box named "permissions", and putting a crows-foot at both ends of the line connecting Managers and Buildings.
Now to come back to your question: which one is "better"? It depends. sometimes, a conceptual diagram is better for discussing the subject matter with the ultimate stakeholders: non-technical managers who work with the data all the time, and might be called "subject matter experts".
A physical diagram is usually better when discussing the proposed design among data architects and programmers. It explains not only how the data works in concept, but also how the database is to be built. This kind of detail is glossed over by a conceptual model.
So you may end up with two diagrams, and use the appropriate one depending on your audience.

When should I use a Column Family NoSQL solution vs Key-Value, Document Store, Graph

I understand the technical differences between the different solutions. But I can't seem to find concrete examples of the pros/cons of the different types of NoSQL solutions, and when to use one type over the other.
All of the information I find online gives very vague suggestions of when to use one type vs the other. And they all seem to be able to be interchangeably used without a clear indication of the advantage of using one over the other.
Document-oriented
Examples: MongoDB, CouchDB
Strengths: Heterogenous data, working object-oriented, agile development
Their advantage is that they do not require a consistent data structure. They are useful when your requirements and thus your database layout changes constantly, or when you are dealing with datasets which belong together but still look very differently. When you have a lot of tables with two columns called "key" and "value", then these might be worth looking into.
Graph databases
Examples: Neo4j, GiraffeDB
Strengths: Data Mining
Their focus is at defining data by its relation to other data. When you have a lot of tables with primary keys which are the primary keys of two other tables (and maybe some data describing the relation between them), then these might be something for you.
Key-Value Stores
Examples: Redis, Cassandra, MemcacheDB
Strengths: Fast lookup of values by known keys
They are very simplistic, but that makes them fast and easy to use. When you have no need for stored procedures, constraints, triggers and all those advanced database features and you just want fast storage and retrieval of your data, then those are for you.
Unfortunately they assume that you know exactly what you are looking for. You need the profile of User157641? No problem, will only take microseconds. But what when you want the names of all users who are aged between 16 and 24, have "waffles" as their favorite food and logged in in the last 24 hours? Tough luck. When you don't have a definite and unique key for a specific result, you can't get it out of your K-V store that easily.
There is an excellent article describing the types of nosql databases and when to use what.. read this
You will get a good understanding.

Why are relational sets important?

A friend is developing a website, and has to make a database using SQL. He asks why do you need "has-a" or "is-a" relationships since you can take the primary keys of a one entity set and place it in the other appropriate entity set (and vice-versa) to find the relations.
I could not answer the question because I was just taught that relational sets are just how database works.
Edit: I did not want to go into normalization. He made a point that the information is replicated in the relationship set.
Your question mixes two different levels of abstraction together, namely the conceptual level and the logical level.
At the conceptual level, one is interested in describing the information requirements on the proposed database. It's useful to do this without tilting the description towards one solution or another. One model that is useful for this purpose is the Entity-Relationship (ER) model. In this model, the subject matter is broken down into entities (subjects) and relationships among those entities. All data is seen as describing some aspect of one of the entites or one of the relationships.
"Is-a" and "has-a" relationships are relevant at this level of abstraction. At this level, relationships are identified, but not implemented.
After creating a conceptual model of the database, but before creating the database itself, it's useful to go through a logical design phase, resulting in a logical model of the database. If the database is to be relational, it's useful to make the logical model a relational one. The relational model is the next level of abstraction.
This is where primary keys and foreign keys come in. These keys implement the relationships that were identified at the conceptual stage. This is how the relational model implements relationships. At this stage, you get involved with design issues like junction tables, table composition, and normalization.
In addition to the conceptual level and the logical level, there are the physical level and the script level. But these are outside the scope of your question.
The two kinds of relationships are features of the problem to be solved. foreign key references to primary keys are features of the proposed solution.

NoSQL / RDBMS hybrid with referential integrity (delete cascade)?

Is there a database out there that gives you the benefit of referential integrity and being able to use a SQL type language for querying, but also lets entities be loosely defined with respect to their data attributes and also the relationships between them?
E.g. take a RBAC type model where you have Permissions, Users, User Groups & Roles. A complex/flexible model could have the following rules:
Roles can have one or more permissions and a permission can belong to one or more Roles
Users can have one or more permissions and a permission can belong to one or more Users
Users Groups can have one or more permissions and a permission can belong to one or more Users Groups
Users can have one or more roles and a role can belong to one or more Users
User Groups can have one or more roles and a role can belong to one or more User Groups
Roles can have one or more roles and a role can belong to one or more Roles
To model the above in an RDBMS would involve the creation of lots of intersection tables. Ideally, all I'd like to define in the database is the entities themselves (User, Role, etc) plus some mandatory attributes. Everything else would then be dynamic (i.e. no DDL required), e.g. I could create a User with a new attribute which wasn't pre-defined. I could also create a relationship between entities that hasn't been predefined, though the database would handle referential integrity like a normal RDBMS.
The above can be achieved to some degree in a RDBMS by creating a table to store entities and another one to store relationships etc, but this overly complicates the SQL needed to perform simple queries and may also have performance implications.
Most NoSQL databases are built to scale very well. This is done at the cost of consistency, of which referential integrity is part of. So most NoSQL don't support any type of relational constraints.
There's one type of NoSQL database that does support relations. In fact, it's designed especially for relations: the graph database. Graph databases store nodes and explicit relations (edges) between these nodes. Both nodes and edges can contain data in the form of key/value pairs, without being tied to a predefined schema.
Graph databases are optimized for relational queries and nifty graph operations, such as finding the shortest path between two nodes, or finding all nodes within a given distance from the current node. You wouldn't need this in a role/permission scenario, but if you do, it'll be a lot harder to achieve using an RDBMS.
Another option is to make your entire data layer a hybrid, by using a RDBMS to store the relations and a document database to store the actual data. This would complicate your application slightly, but I don't think it's such a bad solution. You'll be using two different technologies, both dealing with the problems they were designed to deal with.
Given the requirements you specify in your question, a graph database is probably the sort of thing you are looking for, but there are other options. As #Niels van der Rest said, the two constraints of "no a priori schema" and "referential integrity" are very hard to reconcile. You might be able to find a Topic-Map based database that might do so, but I'm not familiar with specific implementations so I couldn't say for sure.
If you decide you really can't do without referential integrity, I fear you probably are stuck with an RDBMS. There are some tricks you can use that might avoid some of the problems you anticipate, I cover a couple in https://stackoverflow.com/questions/3395606..., which might give you some ideas. Still, for this sort of data-model requiring dynamic, post-priori schema, with meta-schema elements, an RDBMS is always going to be awkward.
If you are willing to forgo referential integrity, then you still have three approaches to consider.
Map/Reduce - in two flavours: distributed record-oriented (think, MongoDB), and column-oriented (think, Cassandra). Scales really really well, but you won't have your SQL-like syntax; joins suck; and matching your architecture to your specific query types is critical. In your case your focus on the entities and their attributes, rather than the relationships between the entities themselves, so I would probably consider a distributed record-oriented store; but only if I expected to need to scale beyond a single node—they do scale really really well.
Document-store - technically in two flavours, but one of them is a distributed record-oriented map/reduce datastore discussed above. The other is an inverted-index (think, Lucene/Solr). Do NOT disregard the power of an inverted-index; they can resolve obscenely complex record predicates amazingly fast. What they can't do is handle well is queries that include correlation or large relational joins. Still, you will be amazed at the incredible flexibility, sufficiently complex record predicates gives you.
Graph-store - come in a few flavours the first is the large-scale, ad-hoc key-value store (think, DBM/TokyoTyrant); the second is the tuple-space (think, Neo4j); the third is the RDF database (think, Sesame/Mulgara). I have a soft-spot for RDF, having helped develop mulgara, so I am not the most objective commenter. Still, if your scalability constraints will permit you to use an RDF-store, I find the inferencing permitted by RDF's denotational semantics (rare amongst noSQL datastore options) invaluable.
Some NoSQL solutions support security and SQL. One of these is OrientDB. The security system is (quite) well explained here.
Furthermore supports SQL.
There's the Gremlin language, supported by the Neo4j graph database. Regarding your example, have a look at Access control lists the graph database way and here. There's also a web-based tool including a REST API to Neo4j and a Gremlin console, see neo4j/webadmin.
You may want to check out MongoDB it is a document based database and so has a flexible schema. It is awesome and worth the time to see if it would suite your needs.

Resources