How do you reconcile the DRY Principle with database efficiency? - database

I will use Python in my code snippet and am referencing Django documentation, but this question is intended to be language agnostic
From the Django tutorial:
A model is the single, definitive source of data about your data. It contains the essential fields and behaviors of the data you’re storing. Django follows the DRY Principle. The goal is to define your data model in one place and automatically derive things from it.
It goes on to say
It should execute SQL statements as few times as possible, and it should optimize statements internally.
How do we reconcile these two design philosophies? To explore the situation, let us assume that we have a table representing profiles. We also have a function get_profile(profile_id) in our data model which does the following:
retrieves the row from the database
performs some computation to generate additional properties
returns this processed record as the Canonical Profile Representation
Next, we find we also need a get_profiles(list_of_profile_ids) function. The DRY principle, as I understand it, would mandate implementation along the lines of
function get_profiles(list_of_profile_ids):
profiles = []
for profile_id in list_of_profile_ids:
profiles.append(get_profile(profile_id))
return profiles
If our list of profile_ids is long, we are now performing many individual database queries. Is this not horribly inefficient, as compared to a function which performs a single database query for profile_id in (id1, id2, id3) and then performing the same computations on each row?
How does one reconcile DRY with keeping execution efficient?

Related

System design: whether to normalize the departments or not

I'm working with two consultants in one project. The thing is we reached a point where both of them cannot get into an agreement and each offer a different approach.
The thing is we have a store with four departments and we want to find the best approach for working with all of them in the same database.
Each department sell different products: Cars, Boats, Jetskies and Motorbikes.
When the data is inserted or updated in each department there are some triggers to be fires so different workflows will begin, when adding a new car there are certain requirements that needs to be checked as well as the details of the car that are completely different than a boat. Also, regarding the data there are not many fields there are in common, I would say so far only the brand, color, model and year, everything else is specific for each deparment due to the different products and how they work with them..
Consultant one says:
Create one table for all the departments and use a column to identify what department the row belongs to, this way you will have only one trigger and inside the trigger you will then call the function/mehod you need for each record type.
Reason: you only have one table (with over 200 fields) and one trigger, is easier to maintain. Also if you need to report you just need to query one table and filter based on the record type. If you need to report for all the items you don't need to have multiple joins.
Consultant two says:
Create one table for each deparment and a trigger for each table.
Reason: you will have smaller tables (aprox 50 fields each) and is more flexible and you have it all separated. If you want to report you need to join the tables as you want to include data from different places.
I see the advantages of having everything in one place but if I want to expand or change anything I have the feeling I will bre creating a beast table as the data grows.
On the other side keep it separated look more appealing but will need to setup everything for each different table.
What would you say is the best approach?
You should probably listen to consultant number two.
The thing is, all design is trade-offs. You need to assess the pros and cons of each approach and you need to think about the risks that each design entails.
What happens when your design grows? (department 5, more details per product type,...)
What happens when the system scales up to higher transaction volumes?
What happens when your business rules change?
I've been doing this for a long time and I've seen some pendulums swing back and forth when it comes to what is "in fashion" as far as database and software best practices.
I'd say right now the prevailing wisdom is that separation of concerns is innately good. This means you should keep your program logic (trigger code) separate for each department. This makes sense because your logic will vary from one product type to the next since they mostly have distinct columns.
This second point is also important, because your stake in the ground for a transactional system should always be start with third normal form (or higher, if necessary). Sometimes you can get away without it, but four different types of objects with 40 or more distinct attributes each doesn't sound like a good candidate for jamming everything into one table. How do you keep track of which columns belong to which type of product, for example? A separate table for each product type keeps this clean and simple - and importantly - easy for your support programmers to understand.
Contrary to what consultant one is saying, having one trigger instead of four is not likely to be easier to maintain if that one trigger is a big bowl of spaghetti, or even four tidy, well written subroutines joined together with a switch type statement.
These days, programmers favour short, atomic, single-purpose functions (triggers, in your case).
If there is enough common data and common business logic that doing it four times seems awkward, then maybe you have a good candidate for a super-type / sub-type design.
I'll say one
These are all Products, It doesn't matter that its a Bike or a Car. You can control the fields and the object by RecordTypes and Page layouts and that will save you from having 4 Objects, which means potentially 8 new classes(if it follows my pattern it could be up to 20+) + all of the workflow rules and validation rules across the these new objects, it will be very hard to maintain a structure that has 4 objects but are all the same thing.. Tracking Products.
Down the road if you decide to add a new product such as planes, it will be very easy to add a plane to this object and the code will be able to pick up from there if needed. You will definitely need Record Types to manage each Product. The trigger code shouldn't be an issue if the consultants are building it properly meaning a trigger should never have any business logic so as long as that is followed all of the code will be maintainable
I will go with one.
I assume you have a large number of products and this list will grow in future. All these are Products at the end. They will have some common fields and common logic.
If you use Process Builder with Invocable classes instead of Triggers, you may be able to get away with just configuration changes while adding a new object, if its fields and functionality are same/similar to a existing object.
There may also be limitation on the number of different objects a profile has access to based on your license types.
Salesforce has a standard object called Product. Its a single object to be classifies based on record type.
I would have gone with approach two if this was not salesforce. Based on how salesforce works and the limitations it imposes one seems like a better and cleaner solution.
I would say option 2.
Why?
(1) I would find one table with 200+ columns harder to maintain. You're also then going to have to expose fields for an object that doesn't need said fields.
(2) You are also going to have to "hide" logic inside the trigger which then decides to do different actions based on the type of department etc...
(3) Option 2 involves more "scaffolding" and separate objects but those are objects are inherently smaller and easier to maintain and don't specifically hide logic or cause any sort of ambiguity.
(4) Option 2 abides by the single responsibility principle. Not everyone follows this I understand but I find it a good guiding principle, as the responsibility for the data lies with the individual table and the responsibility for triggered the action lies with the individual trigger as opposed to just being one mammoth entity/trigger.
** I would state that I am simply looking at this from a software development perspective, I am not sure whether or not SalesForce would handle this setup, but it is the way I would personally prefer to design it. :)
Option 2 for me.
You've said that there is little common data and the trigger logic is completely different. Here are some additional technical considerations.
Option 1 Warnings
The trigger would be a single point of failure and errors will be trickier to debug. I have worked with large triggers where broken logic near the top has stopped logic near the bottom from running, sometimes silently! You also have to maintain conditional guards to control the flow of logic based on the data which is another opportunity for error.
I'm not red hot on indexes but I believe performance will suffer due to no natural order of the multi-purpose data. More specific tables will yield better indexing strategies. Also, large rows can lead to fragmented indexes.
https://blogs.msdn.microsoft.com/pamitt/2010/12/23/notes-sql-server-index-fragmentation-types-and-solutions/
You would need extra consideration when setting nullable/default constraints on each surplus field not relevant to the product in question. These subtleties can introduce bugs and might make it harder if/when you decide to work with a data layer technology such as Entity Framework. E.g. the logical difference between NULL, 0 and 'None', especially on shared columns.

How to evaluate a condition on a large number of database records

I have a datamodel in Entity Framework CodeFirst. This datamodel contains a Contracts entity of which there are about half a million records present in SQL Server. A Contract entity is related to other entities, directly of indirectly.
I now have a backend job that needs to check all contracts for a condition and if that condition evaluates to true, it must do some action on the contract. The problem is that the condition is not that simple that it can be put in some where clause. It's evaluation for a Contract requires us to check the state of a couple of objects in the Contract hierarchy. The condition evaluates to true for a very small fraction of the total number of Contracts in the database.
That means that I only need to load a small number of Contracts in memory, but to determine which ones, I need to have all the Contracts evaluated, so if I don't want to evaluate the condition in the database (for example, in a stored procedure), it seems that I need to load all the Contracts in memory.
So, it seems that there are 2 very suboptimal solutions:
1) Determine the contract ids of the contracts that meet the condition in a stored procedure and then from code only fetch those contracts. This would mean that we put logic in our database which seems to go against the whole philosophy of code first.
2) Fetch the contracts all in memory (part by part, for example in samples of 500) and evaluate the condition in code. This is of course performance wise not very good.
My question is, which alternative ways would there be to solve this problem?
For a given class ComplexClass have a corresponding ComplexClassInfo. The Info class contains key/important properties of ComplexClass. It typically has members for child objects as well. We put in enough properties to give the Info classes some general usefulness. I.E. we don't have Info classes tailored for specific data queries.
Do an initial DB fetch, which may be filtered, for ComplexClassInfo data. Then iterate that ComplexClassInfo collection applying our complex rule. Using the resulting set we query the DB to instantiate the individual ComplexClass objects.
My thoughts on the two approaches are as follows:
Yes, complex logic in stored proc is harder to maintain compared to code. But that does not mean you don't do it. If performance is important to you, that is what you should do.
How much time are you spending in fetching and processing 500K rows? You should be able to optimise it. Consider -
a. number of columns you are fetching. can you optimise there?
b. what is the largest fetch size you can use? Can you set fetch size of 500K?
c. what can you optimise in the code? Memory parameters? Faster algorithm?

Duplication of data in a database versus application design

I have an application design question concerning handling data sets in certain situations.
Let's say I have an application where I use some entities. We have an Order, containing information about the client, deadline, etc. Then we have Service entity having one to many relation with an Order. Service contains it's name. Besides that, we have a Rule entity, that sets some rules concerning what to deduct from the material stock. It has one to many relation with Service entity.
Now, my question is: How to handle situation, when I create an Order, and I persist it to the database, with it's relations, but at the same time, I don't want the changes made to entities that happen to be in a relation with the generated order visible. I need to treat the Order and the data associated with it as some kind of a log, so that removing a service from the table, or changing a set of rules, is not changing already generated orders, services, and rules that were used during the process.
Normally, how I would handle that, would be duplicating Services and Rules, and inserting it into new table, so that data would be independent from the one that is used during Order generation. Order would simply point to the duplicated data, instead of the original one, which would fix my problem. But that's data duplication, and as I think, it's not the best way to do it.
So, if you understood my question, do you know any better idea for solving that kind of a problem? I'm sorry if what I wrote doesn't make any sense. Just tell me, and I'll try to express myself in a better way.
I've been looking into the same case resently, so I'd like to share some thoughts.
The idea is to treat each entity, that requires versioning, as an object and store in the database object's instances. Say, for service entity this could be presented like:
service table, that contains only service_id column, PrimaryKey;
service_state (or ..._instance) table, that contains:
service_id, Foreign Key to the service.service_id;
state_start_dt, a moment in time when this state becomes active, NOT NULL;
state_end_dt, a moment in time when this state is obsoleted, NULLable;
all the real attributes of the service;
Primary Key is service_id + state_start_dt.
for sure, state_start_dt::state_end_dt ranges cannot overlap, should be constrained.
What's good in such approach?
You have a full history of state transitions of your essential objects;
You can query system as it was at the given point in time;
Delivery of new configuration can be done in advance by inserting an appropriate record(s) with desired state_start_dt stamps;
Change auditing is integrated into the design (well, a couple of extra columns are required for a comlpete tracing).
What's wrong?
There will be data duplication. To reduce it make sure to split up the instantiating relations. Like: do not create a single table for customer data, create a bunch of those for credentials, addresses, contacts, financial information, etc.
The real Primary Key is service.service_id, while information is kept in a subordinate table service_state. This can lead to situation, when your service exists, while somebody had (intentionally or by mistake) removed all service_state records.
It's difficult to decide at which point in time it is safe to remove state records into the offline archive, for as long as there are entities in the system that reference service, one should check their effective dates prior to removing any state records.
Due to #3, one cannot just delete records from the service_state. In fact, it is also wrong to rely on the state_end_dt column, for service may have been active for a while and then suppressed. And querying service during moment when it was active should indicate service as active. Therefore, status column is required.
I think, that keeping in mind this approach downsides, it is quite nice.
Though I'd like to hear some comments from the Relational Model perspective — especially on the drawbacks of such design.
I would recommend just duplicating the data in separate snapshot table(s). You could certainly use versioning schemes on the main table(s), but I would question how much additional complexity results in the effort to reduce duplicate data. I find that extra complexity in the data model results in a system that is much harder to extend. I would consider duplicate data to be the lesser of 2 evils here.

Bad practice to have IDs that are not defined in the database?

I am working on an application that someone else wrote and it appears that they are using IDs throughout the application that are not defined in the database. For a simplified example, lets say there is a table called Question:
Question
------------
Id
Text
TypeId
SubTypeId
Currently the SubTypeId column is populated with a set of IDs that do not reference another table in the database. In the code these SubTypeIds are mapped to a specific string in a configuration file.
In the past when I have had these types of values I would create a lookup table and insert the appropriate values, but in this application there is a mapping between the IDs and their corresponding text values in a configuration file.
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Absolutely, yes. It brings in a heavy dependence on the code to manage and maintain references, fetch necessary values, etc. In a situation where you now need to create additional functionality, you would rely on copy-pasting the mapping (or importing them, etc.) which is more likely to cause an issue.
It's similar to why DB constraints should be in the DB rather than in the program/application that's accessing it - any maintenance or new application needs to replicate all the behaviour and rules. Having things this way has similar side-affects I've mentioned here in another answer.
Good reasons to have a lookup table:
Since DBs can generally naturally have these kinds of relations, it would be obvious to use them.
Queries first need to be constructed in code for the Type- and SubType- Text vs ID instead of having them as part of the where/having clause of the query that is actually executed.
Speed/Performance - with the right indexes and table structures, you'd benefit from this (and reduce code complexity that manages it)
You don't need to update your code for to add a new Type or SubType, or to edit/delete them.
Possible reasons it was done that way, which I don't think are valid reasons:
The TypeID and SubTypeID are related and the original designer did not know how to create a complex foreign key. (Not a good reason though.)
Another could be 'translation' but that could also be handled using foreign key relations.
In some pieces of code, there may not be a strict TypeID-to-SubTypeID relation and that logic was handled in code rather than in the DB. Again, can be managed using 'flag' values or NULLs if possible. Those specific cases could be handled by designing the DB right and then working around a unique/odd situation in code instead of putting all the dependence on the code.
NoSQL: Original designer may be under the impression that such foreign keys or relations cannot be done in a NoSQL db.
And the obvious 'people' problem vs technical challenge: The original designer may not have had a proper understanding of databases and may have been a programmer who did that application (or was made to do it) without the right knowledge or assistance.
Just to put it out there: If the previous designer was an external contractor, he may have used the code maintenance complexity or 'support' clause as a means to get more business/money.
As a general rule of thumb, I'd say that keeping all the related data in a DB is a better practice since it removes a tacit dependency between the DB and your app, and because it makes the DB more "comprehensible." If the definitions of the SubTypeIDs are in a lookup table it becomes possible to create queries that return human-readable results, etc.
That said, the right answer probably depends a bit on the specifics of the application. If there's very tight coupling between the DB and app to begin with (eg, if the DB isn't going to be accessed by other clients) this is probably a minor concern particularly if the set of SubTypeIDs is small and seldom changes.

Can you use Decision Tables in Relational Databases

I heard that decision tables in relational database have been researched a lot in academia. I also know that business rules engines use decision tables and that many BPMS use them as well.
I was wondering if people today use decision tables within their relational databases?
A decision table is a cluster of conditions and actions. A condition can be simple enough that you can represent it with a simple "match a column against this value" string. Or a condition could be hellishly complex. An action, similarly, could be as simple as "move this value to a column". Or the action could involve multiple parts or steps or -- well -- anything.
A CASE function in a SELECT or WHERE clause is a decision table. This is the first example of decision table "in" a relational database.
You can have a "transformation" table with columns that have old-value and replacement-value. You can then write a small piece of code like the following.
def decision_table( aRow ):
result= connection.execute( "SELECT replacement_value FROM transformation WHERE old_value = ?", aRow['somecolumn'] )
replacement= result.fetchone()
aRow['anotherColumn']= result['replacement_value']
Each row of the decision table has a "match this old_value" and "move this replacement_value" kind of definition.
The "condition" parts of a decision table have to be evaluated somewhere. Your application is where this will happen. You will fetch the condition values from the database. You'll use those values in some function(s) to see if the rule is true.
The "action" parts of a decision table have to be executed somewhere; again, your application does stuff. You'll fetch action values from the database. You'll use those values to insert, update or delete other values.
Decision tables are used all the time; they've always been around in relational databases. Each table requires a highly customized data model. It also requires a unique condition function and action procedure.
It doesn't generalize well. If you want, you could store XML in the database and invoke some rules engine to interpret and execute the BPEL rules. In this case, the rules engine does the condition and action processing.
If you want, you could store Python (or Tcl or something else) in the database. Seriously. You'd write the conditions and actions in Python. You'd fetch it from the database and run the Python code fragment.
Lots of choices. None of the "academic". Indeed, the basic condition - action stuff is done all the time.
Wheter or not to put decision tables in a database depends on a number of other questions.
Will your conditions be calculated inside the RDBMS or elsewhere? If the data used for evaluating these conditions, and a suitable method for evaluating them inside the RDBMS can be devised, it is probably a good idea. Maybe your actions also happens inside your database, which would make it even more attractive.
Your conditions, and even execution of your actions might be on the outside of the RDBMS, but you could still keep the connections between combinations of conditions and actions on the inside. Probably because most of you other data is there, and all you have is a web server sitting on top of it.
I can think of two ways to model this, depending on how many conditions you have (and wheter they are binary), and what the capacity for columns per table is.
Let's say you have 6 conditions that are binary, this means you have 2^6 = 64 possible combinations. Then you could have one column for every combination, and one row for every action.
Or you could have 16 conditions which means you would have almost an incalculable number of combinations (actually 65536). Which is a ridiculous number of columns. Better then to have a column for each condition and a column for each action and 65536 rows of what to do in each possible situation. Each row would represent a situation and what to do in that situation. The only datatype you use would be bool. You could also package these bools into bitmasked integers.
Actually, bigger decision tables are better avoided. Divide and rule, and use more tables is a much better way. Usually a subject matter expert will get tired if asked to give opinions on too high a number of conditions.
The strength of the decision table is really in the modelling stage where the developer and the subject matter expert can find out if every possible situation is mapped, and no blind spots can exist.
I think they will contribute to the already too much declined state of what used to be "in-person" communications- enough hide behind the screen as it is..... come out of the closet, get out - got the picture.
I would look into using an Object database rather than a traditional RDBMS (Relational Database Management System). Object databases are designed to be fast at handling hierarchical relationships between objects, whereas in an RDBMS, you have to represent these relationships across multiple table rows, or even tables so your queries (tree traversals) will be slow.

Resources