Permissions on nested REST API resources that I decided to flatten - database

I'm currently trying to model a workout plan in the database and it consists of
workout plan
workout day
exercise routine
sets
My initial idea was to do something like this:
workout_plans/:id/workout_days/:id/exercise_routines/:id/sets/:id
I noticed that it has a really heavy nested structure so I decided to flatten it:
workout_plans/:id/workout_days
workout_days/:id/exercise_routines
etc...
I've got a simple permission junction table that states which user, which permission (read, create, delete) has on which workout plan. I've decided that I currently don't need more granular permission level on the other entities and that having a certain permission on a workout plan should automatically be transitive to its nested children entities.
So if I have a read permission on a workout plan, I should be able to read its workout days and so on.
Now my problem with the "flat" rest api approach is that I don't have the convenience of knowing immediately which workout plan a certain exercise_routine or set is part of. Thus I need to "join" my way up to the parent until I know which workout plan an entity is in relation with and then I can finally check the permission.
So my question is if there is a better way of doing this? My first idea of solving this is by having a reference id to the parent workout_plan for every entity down the line but this approach seems kind of fishy to me. Or maybe I should just stick with the nested structure? Thanks for the help in advance!

You've hit the main issue with handling "inherited" permissions in a traditional database-backed access control system. In previous experience, I've found that joins are useful (and reasonably performant) if the nesting is only a single level deep and the table sizes are kept to a reasonable size: once you hit multiple levels, or extremely large tables, the only real "solution" within the database I've found is to denormalize the data as you suggested by including the other ID(s), but even this can become an issue as tables grow larger (and a massive pain if you need to add another nesting level at some point).
The unease you're feeling cuts to an issue with using databases for access control: Permissions are often inherited, inferred, or even sometimes recursive in nature, but traditional relational databases were not designed to allow for easy or performant resolution of these structures because permissions relationships are actually graphs. In your case, you have a tree of permissions relationships, from workout plan, to day, to routine and so forth. There are services (this is my own) dedicated to solving deeply nested relationships, which might be helpful once you've pushed denormalization to its limits.

Related

Database structure when implementing a Slack style workspace/instance architecture

I'm working on an app that has a Slack style workspace architecture where the user can access the same function of the application under multiple "instances" (workspaces).
I'm going to continue with using Slack as an example to explain my issue.
When any action is taken in my application I need to validate that the user has the rights to perform an action on the specified resource and that the resource is within the same workspace as the user.
The first tables I create such as Users have a simple database relationship to the workspace. Using a WorkspaceId field in the Users table for example.
My issue is as I create more tables which are "further" away such as UserSettings which might be a one to one relationship to the Users table I now have to do a join to the Users record to get the workspace which the UserSettings record belongs to.
So now I am thinking is it worth adding a workspaceId value on all tables since I will endup doing a lot of JOINs in my database to continue verifying that the user has permissions to that resource.
Looking for advice/architecture patterns which may help with the scenario.
I'm assuming your main concern with multiple JOIN statements is that the query performance will suffer. Multiple JOIN statements don't always mean a query will be slow. The query performance depends on many factors, how large the dataset is and how well indexed it is, what database engine and ultimately what the query plan is. You'll only end up with lots of JOIN statements if you decide to normalize the database that way. Using a full third normal form is rarely the right choice for a schema because of the potential performance impacts it can have. Some duplication of data is generally okay, the trade off you are making is storage cost vs query performance. To decide on how to normalize the database there are many questions you should be asking here's some that come to mind:
What type of queries do you expect to make?
How often will each type of query be made?
How often will the data change and can a cache be used?
Does a different storage technology better suit the use case?
Is some of the data small enough that it can be all in one table?
In my experience designing user management systems, usually ends up with a cache or similar mechanism for having fast user to a given users permissions that has an acceptable expiry window. This means you are only querying the database for a given user at the expiry window and using the cache a majority of the time. This is why many security systems and user systems don't immediately update settings. The more granular and flexible the type of permission you want to grant user the more expensive the query is going to be because of the complexity. At which point you can decide to denormalize the data or use a coaching mechanism.

Use Event Sourcing to get rid of relational database?

I've read some articles about CQRS and Event Sourcing recently. While the first seemed to me like a highly complex and risky workaround to fix poorly performing business and poorly designed data access layers and data models, the last seemed like a solution to many problems.
Problems to solve using Event Sourcing:
Get rid of Relational Database and Object Relational Mappers, like NHibernate and Entity Framework. Hardly anybody in the programming area wants to pay attention to such stuff like indices, table/index fragmentation or normalization, how to design relational data and how to code/configure the ORM (a science on its own).
Have Business Model and in-memory "database" united, an entity/aggregate service keeping all relevant items in memory, maintaining integrity by simply dumping the CUD events somewhere without much pain. Old items can be evicted from memory and dumped to a NoSQL (or whatever) store and used for aggregate calculations, reporting, search and, if necessary, be re-activated. If I understand right, in-memory databases like VoltDB use event dumping in a similar way, but are still relational databases, separated from the business logic.
This would also make concurrency easier: instead of locking (with possible complete system deadlocks) or optimistic locking with a general "success or fail" logic, depending on whether the data has changed meanwhile (or rather complex DB code), merge rules can be implemented in code.
History: no more pain with implementing auditing functions, cemetary tables or "deleted" marker columns, or possibly deleted data still being required.
Data Duplication/Search/Reporting: use full-text indices instead of chasing missing relational indices, create proper viewing areas, preparing the data for the user in a required format, instead of using ugly copy routines in relational databases, with triggers, followup stored procedures or even program code copying data to half a dozen different tables.
Versioning: it's a pain to get many modules running with a number of different relational database versions, each having different tables and columns and needs appropriate ORM mappings. Could be much easier in a single layer model, with the event dump accepting any object format (typically schema-less or loose-schema NoSQL documents, represented as JSON or XML). It might also be possible to upgrade old data through a "data schema change event" chain (instead of having to maintain migration scripts for relational DBs).
N-tier Business Model / Relational DB / ORM mess
An n-tier approach a decade or longer ago might have been a business layer and data access layer. In order to keep separation really strict, many relational features were omitted, to implement them in the business layer instead: relational integrity, normalization, with the DB being what I call a "trash dump": looking like a kid playing around with SQL Server Management Studio or Access. Extremly un-normalized, polymorphic references ("Foreign Key" columns referencing different source tables, identified by a "ReferenceSource" marker), abuse of same tables for different kinds of business objects and duplication of data to numerous other tables (and from there again elsewhere), because performance wasnt good and this was supposed to improve queries. ORM usage was without object references too, reduced to single object load and save operations. Loading an aggregate (a graph of entities/table rows) would iterate through the graph and send a query for every set of sub-entities.
When performance got worse and, possibly, orphaned references caused serious trouble, attempts to implement classic relational design might have been made, but it was impossible to adapt the grown system to a complete data redesign (nobody would pay for it), hardly anybody would know how to map object relations or even optimized loading in the ORM. Such attempts were limited to a few places in the design, possibly making the data model and access even harder to maintain.
CQRS on top of n-tier?
To get acceptable performance, separate SQL queries were possibly built for certain modules, bypassing the business model with it's single object iterative access. This whole structure was suddenly called a de-facto CQRS, because of separate query access (which -could- have been handled by a well-implemented, relational data model and ORM usage, as long as it wasnt supposed to be a "Big data" Google- or Stackoverflow-like workload), and the plenty of duplicated data in relational tables, made up for immediate application access.
Something better than the inappropriate table format?
OK, so I read into CQRS, and while I didn't like the use of "CQRS" as described before, the concept of an event storage instead of a relational DB looked very useful: It is unlikely to successfully enforce the introduction of the original, state-of-art, relational DB design and OR mapping, and even if, it would be extremly costly. In fact, ordinary, object-oriented programming is much more "normalized" than most DB tables, due to the need to press all into the table format or create tons of tables for object graphs/aggregates. And I agree: having to take care of search indices and defragmentation, schema management and data history tracking manually, is like stone age IT, like running Ford T models and steam locomotives besides modern day cars and electric high speed trains.
Any good Experiences?
How are the experiences about using event sourcing (not necessarily full CQRS)? Does it eliminate much of the pain with relational databases? I really look forward for a kind of in-memory database with all business logic integrated, and possibly fast enough to make separate query modules dispensable!
There's a lot going on in this question and so a specific, actionable answer is not possible, but if you're looking for one then it is...
It depends on your domain.
CQRS/ES/DDD is not appropriate for solving every single problem - it is not a silver bullet. If the domain suggests that CRUD/NTier will be good enough, then that's what you should use. All of the concerns you list in your question are infrastructural or system traits and say nothing about the very thing that should inform your choice of tool or practice; what are you trying to build?
Although CQRS, ES, and DDD are very often used together they are separate concepts that are very powerful on their own.
CQRS (Command Query Responsibility Segregation): This is a very useful pattern to design software in general. The idea is to keep things that change state (commands) from things that do not (queries). In many systems queries modify the state of the database, this makes it very difficult for developers to reason about what is going on.
Imagine doing a query to find out some information and realizing that the information changed because you queried it.
CQRS prohibits those kind of behaviors. Commands (which cannot return information) change state and Queries (which return information) cannot modify state. That way, you have certainty in which parts of the code are idempotent (and therefore can be called as much as you want with no side effect) and which parts change state.
DDD (Domain Driven Design): This is a Design methodology for the "Data Structure" of the code. It does not prescribe techniques for database access or many technical details. What it does is provide guidelines and concepts to structure data in an application in a way that makes it much more responsive to the actual user's needs. It also simplifies development (although it is more work than just slapping something together).
ES (Event Sourcing): Event sourcing is a data storage strategy which shifts data storage from state (the actual values of a piece of data at the current point in time) into transitions (the changes that have happened to a piece of data during its lifetime) which are called events.
There are several advantages of using ES.
First, it allows the business to store much more information regarding what happened before (a boon to Data Scientists). In traditional systems, a lot of information is lost to updates of the data, and unless those updates are explicitly logged, the information is gone forever. This does not happen in ES.
Second, storing all events makes debugging much more simple because now a developer can follow the processing of the data since its beginning. An update to a piece of data that happened a long time ago (and would have been rewritten by another update and lost) but corrupted processing can be identified and fixed. Furthermore, the effects of the fix can even span all calculations that happened between the wrong event and the last event. In a traditional system, this would be impossible as we are only storing the latest state only.
While it is theoretically possible to write an Event Sourced system without CQRS or DDD, it is remarkably more difficult to do so.

Should this information be calculated in real time or stored in a seperate database?

I am working on a group project and we are having a discussion about whether to calculate data that we want from an existing database and store it in a new database to query later, or calculate the data from the existing database every time we need to use it. I was wondering what the pros and cons may be for either implementation. Is there any advice you could give?
Edit: Here is more elaborate explanation. We have a large database that has a lot of information being submitted to it daily. We are building a system to track certain points of data. For example, we are getting the count of how many times a user does something that is entered in the database. Using this example (are actual idea is a bit more complex), we are discussing to methods of getting the count of actions per users. The first method is to create a database that stores the users and their action count, and query this database every time we need the action count. The second method would be to query the large database and count the actions per user every time we need to use it. I hope this explanation helps explain. Thoughts?
Edit 2: Two more things that may be useful to point out is 1: I only have read access to the large database and 2: My ultimate goal is to display this information on a web page for end users.
This is a generic question about optimization by caching. The following was my answer to essentially the same question. Even though that question provided a bunch of different details, none of them were specific enough to merit a non-generic answer either:
The more you want to calculate at query time, the more you want views,
calculated columns and stored or user routines. The more you want to
calculate at normalized base update time, the more you want cascades
and triggers. The more you want to calculate at some other (scheduled
or ad hoc) time, the more you use snapshots aka materialized views and
updated denormalized bases. You can combine these. Any time the
database is accessed it can be enabled by and restricted by stored
routines or other api.
Until you can show that they are in adequate, views and calculated
columns are the simplest.
The whole idea of a DBMS is to store a representation of your
application state as the database (which normalization reduces the
redundancy of) and then you query and let the DBMS implement and
optimize calculation of the answer. You haven't presented a reason for
not doing that in the most straightforward way possible.
[sic]
Always make sure an application is reading its own personal ("external") database that is a view of "the" ("conceptual") database so that when you change the implemention of the former (plus the rest of some combined interfact) by the latter (plus the rest of some compbined mechanisms) your applications do not have to change ("logical independence"). Here the applications are your users' and your trackers'.
Ultimately you must instrument and guestimate. When it is worth it you start caching. Preferably as much as possible in terms of high-level notions like views and snapshots and as little as possible in non-DBMS code. One of he benefits of the relational model is that it is easy to describe a strightforward relational interface in terms of another straightforward relational interface. You protect your applications from change by offering an interface that hides secrets of implementation or which of a family of interfaces is the current one.

is EAV - Hybrid a bad database design choice

We have to redesign a legacy POI database from MySQL to PostgreSQL. Currently all entities have 80-120+ attributes that represent individual properties.
We have been asked to consider flexibility as well as good design approach for the new database. However new design should allow:
n no. of attributes/properties for any entity i.e. no of attributes for any entity are not fixed and may change on regular basis.
allow content admins to add new properties to existing entities on the fly using through admin interfaces rather than making changes in db schema all the time.
There are quite a few discussions about performance issues of EAV but if we don't go with a hybrid-EAV we end up:
having lot of empty columns (we still go and add new columns even if 99% of the data does not have those properties)
spend more time maintaining database esp. when attributes keep changing.
no way of allowing content admins to add new properties to existing entities
Anyway here's what we are thinking about the new design (basic ERD included):
Have separate tables for each entity containing some basic info that is exclusive e.g. id,name,address,contact,created,etc etc.
Have 2 tables attribute type and attribute to store properties information.
Link each entity to an attribute using a many-to-many relation.
Store addresses in different table and link to entities using foreign key.
We think this will allow us to be more flexible when adding,removing or updating on properties.
This design, however, will result in increased number of joins when fetching data e.g.to display all "attributes" for a given stadium we might have a query with 20+ joins to fetch all related attributes in a single row.
What are your thoughts on this design, and what would be your advice to improve it.
Thank you for reading.
I'm maintaining a 10 year old system that has a central EAV model with 10M+ entities, 500M+ values and hundreds of attributes. Some design considerations from my experience:
If you have any business logic that applies to a specific attribute it's worth having that attribute as an explicit column. The EAV attributes should really be stuff that is generic, the application shouldn't distinguish attribute A from attribute B. If you find a literal reference to an EAV attribute in the code, odds are that it should be an explicit column.
Having significant amounts of empty columns isn't a big technical issue. It does need good coding and documentation practices to compartmentalize different concerns that end up in one table:
Have conventions and rules that let you know which part of your application reads and modifies which part of the data.
Use views to ease poking around the database with debugging tools.
Create and maintain test data generators so you can easily create schema conforming dummy data for the parts of the model that you are not currently interested in.
Use rigorous database versioning. The only way to make schema changes should be via a tool that keeps track of and applies change scripts. Postgresql has transactional DDL, that is one killer feature for automating schema changes.
Postgresql doesn't really like skinny tables. Each attribute value results in 32 bytes of data storage overhead in addition to the extra work of traversing all the rows to pull the data together. If you mostly read and write the attributes as a batch, consider serializing the data into the row in some way. attr_ids int[], attr_values text[] is one option, hstore is another, or something client side, like json or protobuf, if you don't need to touch anything specific on the database side.
Don't go out of your way to put everything into one single entity table. If they don't share any attributes in a sensible way, use multiple instantitions of the specific EAV pattern you use. But do try to use the same pattern and share any accessor code between the different instatiations. You can always parametrise the code on the entity name.
Always keep in mind that code is data and data is code. You need to find the correct balance between pushing decisions into the meta-model and expressing them as code. If you make the meta-model do too much, modifying it will need the same kind of ability to understand the system, versioning tools, QA procedures, staging as your code, but it will have none of the tools. In essence you will be doing programming in a very awkward non-standard language. On the other hand, if you leave too much in the code, every trivial change will need a new version of your software. People tend to err on the side of making the meta-model too complex. Building developer tools for meta-models is hard and tedious work and has limited benefit. On the other hand, making the release process cheaper by automating everything that happens from commit to deploy has many side benefits.
EAV can be useful for some scenarios. But it is a little like "the dark side". Powerful, flexible and very seducing it is. But it's something of an easy way out. An easy way out of doing proper analysis and design.
I think "entity" is a bit over the top too general. You seem to have some idea of what should be connected to that entity, like address and contact. What if you decide to have "Books" in the model. Would they also have adresses and contacts? I think you should try to find the right generalizations and keep the EAV parts of the model to a minium. Whenever you find yourself wanting to show a certain subset of the attributes, or test for existance of the value, or determining behaviour based on the value you should really have it modelled as a columns.
You will not get a better opportunity to design this system than now. The requirements are known since the previous version, and also what worked and what didn't. (Just don't fall victim to the Second System Effect)
One good implementation of EAV can be found in magento, a cms for ecommerce. There is a lot of bad talk about EAV those days, but I challenge anyone to come up with another solution than EAV for dealing with infinite product attributes.
Sure you can go about enumerating all the columns you would need for every product in the world, but that would take you a lot of time and you would inevitably forget product attributes in the way.
So the bottom line is : use EAV for infinite stuff but don't rely on EAV for all the database's tables. Hence an hybrid EAV and relational db, when done right, is a powerful tool that could not be acomplished by only using fixed columns.
Basically EAV is trying to implement a database within a database, and it leads to madness. The queries to pull data become overly complex, and your data has no stable, specific model to keep it in some kind of order.
I've written EAV systems for limited applications, but as a generic solution it's usually a bad idea.

Users asking for denormalized database

I am in the early stages of developing a database-driven system and the largest part of the system revolves around an inheritance type of relationship. There is a parent entity with about 10 columns and there will be about 10 child entities inheriting from the parent. Each child entity will have about 10 columns. I thought it made sense to give the parent entity its own table and give each of the children their own tables - a table-per-subclass structure.
Today, my users requested to see the structure of the system I created. They balked at the idea of the table-per-subclass structure. They would prefer one big ~100 column table because it would be easier for them to perform their own custom queries.
Should I consider denormalizing the database for the sake of the users?
Absolutely not. You can always create a view later to show them what they want to see.
They are effectively asking for a report.
You could give them access to a view containing all the fields they require... that way you don't mess up your data model.
No. Structure the data properly and if the users need the a denormalized view of the data create it as a VIEW in the database.
Alternatively, consider that perhaps an RDBMS is not the appropriate storage tool for this project.
They are the users and not the programmers of the system for a reason. Provide a separate interface for their queries. Power users like this can both be helpful and a pain to deal with. Just explain you need the database designed a certain way so you can do your job, period. Once that is accomplished you and provide other means to make querying easier.
What do they know!? You could argue that users shouldn't even be having direct access to a database in the first place.
Doing that leaves you open to massive performance issues, just because a couple of users are running ridiculous queries.
How about if you created a VIEW in the format your users wanted while still maintaining a properly normalized table?
Aside from a lot of the technical reasons for or against your users' proposition, you need to be on same page in communicating the consequences of various scenarious and (more importantly) the costs of those consequences. If the users are your clients and they are paying you to do a job, explain that their awful "proposed" ideas may cost them more money in development time, additional hardware resources, etc.
Hopefully you can explain it in such a way that shows your expertise and why your idea is a much better value to your users in the long run.
As everyone more or less mentioned, that way lies madness, and you can always build a view.
If you just can't get them to come around on this point, consider showing them this thread and the number of pros who weighed in saying that the users are meddling with things that they don't fully understand, and the impact will be an undermined foundation.
A big part of the developer's craft is the feel for what won't work out long term, and the rules of normalization are almost canonical in that respect. There are situations where you need to denormalize (data warehouses, etc) but this doesn't sound like one of them!
It also sounds as though you may have a particularly troubling brand of user on your hand -- the amatuer developer who thinks they could do your job better themselves if only they had the time. This may or may not help, but I've found that those types respond well to presentation -- a few times now I've found that if I dress sharp and show a little bit of force in my personality, it helps them feel like I'm an expert and prevents a bunch of problems before they start.
I would strongly recommend coming up with an answer that doesn't involve someone running direct reports against your database. The moment that happens, your DB structure is set in stone and you can basically consider it legacy.
A view is a good start, but later on you'll probably want to structure this as an export, to decouple further. Of course, then you'll encounter someone who wants "real time" data. Proper business analysis usually reveals this to be unnecessary. Actual real time requirements are not best handled through reporting systems.
Just to be clear: I'd personally favour the table per subclass approach, but I don't think it's actually as big an issue as the direct reporting off transaction tables is going to be.
I would opt for a view (as others have suggested) or an inline table-valued function (the benefits of this is you require parameters - like an date range or a customer account - which can help to stop users from querying without any limits on the problem space) first. An inline TVF is really a parametrized view and is far closer to a view in terms of how the engine treats them than it is to a multi-statement table valued function or a scalar function, which can perform incredibly poorly.
However, in some cases, this can impact production performance if the view is complex or intensive. With poorly written ad hoc user queries, it can also cause locks to persist longer or be escalated further than they would on a better built query. It is also possible for users to misinterpret an E-R data model and produce multiplied numbers in cases where there are many-to-one or many-to-many relationships. The next option might be to materialize these views with indexes or make tables and keep them updated, which gets us closer to my next option...
So, given those drawbacks of the view option and already thinking of mitigating it by starting to make copies of data, the next option I would consider is to have a separate read-only (for these users) version of the data which is structured differently. Typically, I would first look at a Kimball-style star schema. You do not need to have a full-fledged time-consistent data warehouse. Of course, that's an option, but you could simply keep a reporting model up to date with data. Star-schemas are a special form of denormalization and are particularly good for numerical reporting, and a given star should not be able to be abused by users accidentally. You can keep the star up to date in a number of ways, including triggers, scheduled jobs, etc. They can be very fast for reporting needs and run on the same production installation - perhaps on a separate instance if not just a separate database.
Although such a solution may require you to effectively more than double your storage requirements, when compared with other practices it might be a really good option if you understand your data well and don't mind having two models - one for transactions and one for analysis (note that you will already start to have this logical separation anyway with the use of a the simplest first option of view).
Some architects will often double their servers and use the SAME model with some kind of replication in order to provide a reporting server which is indexed more heavily or differently. Such a second server doesn't impact production transactions with reporting requirements and can be kept up to date fairly easily. There will only be one model, but of course, this has the same usability problems with allowing users ad hoc access to the underlying model only, without the performance affects, since they get their own playground.
There are a lot of ways to skin these cats. Good luck.
The customer is always right. However, the customer is likely to back down when you convert their requirement into dollars and cents. A 100 column table will require extra dev time to write the code that does what the database would do automatically with the proper implementation. Further, their support costs will be higher since more code means more problems and lower ease of debugging.
I'm going to play devil's advocate here and say that both solutions sound like poor approximations of the actual data. There's a reason that object-oriented programming languages don't tend to be implemented with either of these data models, and it's not because Codd's 1970 ideas about relations were the ideal system for storing and querying object-oriented data structures. :-)
Remember that SQL was originally designed as a user interface language (that's why it looks vaguely like English and not at all like other languages of that era: Algol, C, APL, Prolog). The only reasons I've heard for not exposing a SQL database to users today are security (they could take down the server!) and usability (who wants to write SQL when you can clicky clicky?), but if it's their server and they want to, then why not let them?
Given that "the largest part of the system revolves around an inheritance type of relationship", then I'd seriously consider a database that lets me represent that natively, either Postgres (if SQL is important) or a native object database (which are awesome to work with, if you don't need SQL compatibility).
Finally, remember that every engineering decision is a tradeoff. By "sticking to your guns" (as somebody else proposed), you're implicitly saying the value of your users' desires are zero. Don't ask SO for a correct answer to this, because we don't know what your users want to do with your data (or even what your data is, or who your users are). Go tell them why you want a many-tables solution, and then work out a solution with them that's acceptable to both of you.
You've implemented Class Table Inheritance and they're asking for Single Table Inheritance. Both designs are valid in certain situations.
You might want to get a copy of Martin Fowler's Patterns of Enterprise Application Architecture to read more about the advantages and disadvantages of each design. That book is a classic reference to have on your bookshelf, in any case.

Resources