Duplication of data in a database versus application design

Duplication of data in a database versus application design - database

I have an application design question concerning handling data sets in certain situations.
Let's say I have an application where I use some entities. We have an Order, containing information about the client, deadline, etc. Then we have Service entity having one to many relation with an Order. Service contains it's name. Besides that, we have a Rule entity, that sets some rules concerning what to deduct from the material stock. It has one to many relation with Service entity.
Now, my question is: How to handle situation, when I create an Order, and I persist it to the database, with it's relations, but at the same time, I don't want the changes made to entities that happen to be in a relation with the generated order visible. I need to treat the Order and the data associated with it as some kind of a log, so that removing a service from the table, or changing a set of rules, is not changing already generated orders, services, and rules that were used during the process.
Normally, how I would handle that, would be duplicating Services and Rules, and inserting it into new table, so that data would be independent from the one that is used during Order generation. Order would simply point to the duplicated data, instead of the original one, which would fix my problem. But that's data duplication, and as I think, it's not the best way to do it.
So, if you understood my question, do you know any better idea for solving that kind of a problem? I'm sorry if what I wrote doesn't make any sense. Just tell me, and I'll try to express myself in a better way.

I've been looking into the same case resently, so I'd like to share some thoughts.
The idea is to treat each entity, that requires versioning, as an object and store in the database object's instances. Say, for service entity this could be presented like:
service table, that contains only service_id column, PrimaryKey;
service_state (or ..._instance) table, that contains:
service_id, Foreign Key to the service.service_id;
state_start_dt, a moment in time when this state becomes active, NOT NULL;
state_end_dt, a moment in time when this state is obsoleted, NULLable;
all the real attributes of the service;
Primary Key is service_id + state_start_dt.
for sure, state_start_dt::state_end_dt ranges cannot overlap, should be constrained.
What's good in such approach?
You have a full history of state transitions of your essential objects;
You can query system as it was at the given point in time;
Delivery of new configuration can be done in advance by inserting an appropriate record(s) with desired state_start_dt stamps;
Change auditing is integrated into the design (well, a couple of extra columns are required for a comlpete tracing).
What's wrong?
There will be data duplication. To reduce it make sure to split up the instantiating relations. Like: do not create a single table for customer data, create a bunch of those for credentials, addresses, contacts, financial information, etc.
The real Primary Key is service.service_id, while information is kept in a subordinate table service_state. This can lead to situation, when your service exists, while somebody had (intentionally or by mistake) removed all service_state records.
It's difficult to decide at which point in time it is safe to remove state records into the offline archive, for as long as there are entities in the system that reference service, one should check their effective dates prior to removing any state records.
Due to #3, one cannot just delete records from the service_state. In fact, it is also wrong to rely on the state_end_dt column, for service may have been active for a while and then suppressed. And querying service during moment when it was active should indicate service as active. Therefore, status column is required.
I think, that keeping in mind this approach downsides, it is quite nice.
Though I'd like to hear some comments from the Relational Model perspective — especially on the drawbacks of such design.

I would recommend just duplicating the data in separate snapshot table(s). You could certainly use versioning schemes on the main table(s), but I would question how much additional complexity results in the effort to reduce duplicate data. I find that extra complexity in the data model results in a system that is much harder to extend. I would consider duplicate data to be the lesser of 2 evils here.

Related

System design: whether to normalize the departments or not

I'm working with two consultants in one project. The thing is we reached a point where both of them cannot get into an agreement and each offer a different approach.
The thing is we have a store with four departments and we want to find the best approach for working with all of them in the same database.
Each department sell different products: Cars, Boats, Jetskies and Motorbikes.
When the data is inserted or updated in each department there are some triggers to be fires so different workflows will begin, when adding a new car there are certain requirements that needs to be checked as well as the details of the car that are completely different than a boat. Also, regarding the data there are not many fields there are in common, I would say so far only the brand, color, model and year, everything else is specific for each deparment due to the different products and how they work with them..
Consultant one says:
Create one table for all the departments and use a column to identify what department the row belongs to, this way you will have only one trigger and inside the trigger you will then call the function/mehod you need for each record type.
Reason: you only have one table (with over 200 fields) and one trigger, is easier to maintain. Also if you need to report you just need to query one table and filter based on the record type. If you need to report for all the items you don't need to have multiple joins.
Consultant two says:
Create one table for each deparment and a trigger for each table.
Reason: you will have smaller tables (aprox 50 fields each) and is more flexible and you have it all separated. If you want to report you need to join the tables as you want to include data from different places.
I see the advantages of having everything in one place but if I want to expand or change anything I have the feeling I will bre creating a beast table as the data grows.
On the other side keep it separated look more appealing but will need to setup everything for each different table.
What would you say is the best approach?

You should probably listen to consultant number two.
The thing is, all design is trade-offs. You need to assess the pros and cons of each approach and you need to think about the risks that each design entails.
What happens when your design grows? (department 5, more details per product type,...)
What happens when the system scales up to higher transaction volumes?
What happens when your business rules change?
I've been doing this for a long time and I've seen some pendulums swing back and forth when it comes to what is "in fashion" as far as database and software best practices.
I'd say right now the prevailing wisdom is that separation of concerns is innately good. This means you should keep your program logic (trigger code) separate for each department. This makes sense because your logic will vary from one product type to the next since they mostly have distinct columns.
This second point is also important, because your stake in the ground for a transactional system should always be start with third normal form (or higher, if necessary). Sometimes you can get away without it, but four different types of objects with 40 or more distinct attributes each doesn't sound like a good candidate for jamming everything into one table. How do you keep track of which columns belong to which type of product, for example? A separate table for each product type keeps this clean and simple - and importantly - easy for your support programmers to understand.
Contrary to what consultant one is saying, having one trigger instead of four is not likely to be easier to maintain if that one trigger is a big bowl of spaghetti, or even four tidy, well written subroutines joined together with a switch type statement.
These days, programmers favour short, atomic, single-purpose functions (triggers, in your case).
If there is enough common data and common business logic that doing it four times seems awkward, then maybe you have a good candidate for a super-type / sub-type design.

I'll say one
These are all Products, It doesn't matter that its a Bike or a Car. You can control the fields and the object by RecordTypes and Page layouts and that will save you from having 4 Objects, which means potentially 8 new classes(if it follows my pattern it could be up to 20+) + all of the workflow rules and validation rules across the these new objects, it will be very hard to maintain a structure that has 4 objects but are all the same thing.. Tracking Products.
Down the road if you decide to add a new product such as planes, it will be very easy to add a plane to this object and the code will be able to pick up from there if needed. You will definitely need Record Types to manage each Product. The trigger code shouldn't be an issue if the consultants are building it properly meaning a trigger should never have any business logic so as long as that is followed all of the code will be maintainable

I will go with one.
I assume you have a large number of products and this list will grow in future. All these are Products at the end. They will have some common fields and common logic.
If you use Process Builder with Invocable classes instead of Triggers, you may be able to get away with just configuration changes while adding a new object, if its fields and functionality are same/similar to a existing object.
There may also be limitation on the number of different objects a profile has access to based on your license types.
Salesforce has a standard object called Product. Its a single object to be classifies based on record type.
I would have gone with approach two if this was not salesforce. Based on how salesforce works and the limitations it imposes one seems like a better and cleaner solution.

I would say option 2.
Why?
(1) I would find one table with 200+ columns harder to maintain. You're also then going to have to expose fields for an object that doesn't need said fields.
(2) You are also going to have to "hide" logic inside the trigger which then decides to do different actions based on the type of department etc...
(3) Option 2 involves more "scaffolding" and separate objects but those are objects are inherently smaller and easier to maintain and don't specifically hide logic or cause any sort of ambiguity.
(4) Option 2 abides by the single responsibility principle. Not everyone follows this I understand but I find it a good guiding principle, as the responsibility for the data lies with the individual table and the responsibility for triggered the action lies with the individual trigger as opposed to just being one mammoth entity/trigger.
** I would state that I am simply looking at this from a software development perspective, I am not sure whether or not SalesForce would handle this setup, but it is the way I would personally prefer to design it. :)

Option 2 for me.
You've said that there is little common data and the trigger logic is completely different. Here are some additional technical considerations.
Option 1 Warnings
The trigger would be a single point of failure and errors will be trickier to debug. I have worked with large triggers where broken logic near the top has stopped logic near the bottom from running, sometimes silently! You also have to maintain conditional guards to control the flow of logic based on the data which is another opportunity for error.
I'm not red hot on indexes but I believe performance will suffer due to no natural order of the multi-purpose data. More specific tables will yield better indexing strategies. Also, large rows can lead to fragmented indexes.
https://blogs.msdn.microsoft.com/pamitt/2010/12/23/notes-sql-server-index-fragmentation-types-and-solutions/
You would need extra consideration when setting nullable/default constraints on each surplus field not relevant to the product in question. These subtleties can introduce bugs and might make it harder if/when you decide to work with a data layer technology such as Entity Framework. E.g. the logical difference between NULL, 0 and 'None', especially on shared columns.

Bad practice to have IDs that are not defined in the database?

I am working on an application that someone else wrote and it appears that they are using IDs throughout the application that are not defined in the database. For a simplified example, lets say there is a table called Question:
Question
------------
Id
Text
TypeId
SubTypeId
Currently the SubTypeId column is populated with a set of IDs that do not reference another table in the database. In the code these SubTypeIds are mapped to a specific string in a configuration file.
In the past when I have had these types of values I would create a lookup table and insert the appropriate values, but in this application there is a mapping between the IDs and their corresponding text values in a configuration file.
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?

Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Absolutely, yes. It brings in a heavy dependence on the code to manage and maintain references, fetch necessary values, etc. In a situation where you now need to create additional functionality, you would rely on copy-pasting the mapping (or importing them, etc.) which is more likely to cause an issue.
It's similar to why DB constraints should be in the DB rather than in the program/application that's accessing it - any maintenance or new application needs to replicate all the behaviour and rules. Having things this way has similar side-affects I've mentioned here in another answer.
Good reasons to have a lookup table:
Since DBs can generally naturally have these kinds of relations, it would be obvious to use them.
Queries first need to be constructed in code for the Type- and SubType- Text vs ID instead of having them as part of the where/having clause of the query that is actually executed.
Speed/Performance - with the right indexes and table structures, you'd benefit from this (and reduce code complexity that manages it)
You don't need to update your code for to add a new Type or SubType, or to edit/delete them.
Possible reasons it was done that way, which I don't think are valid reasons:
The TypeID and SubTypeID are related and the original designer did not know how to create a complex foreign key. (Not a good reason though.)
Another could be 'translation' but that could also be handled using foreign key relations.
In some pieces of code, there may not be a strict TypeID-to-SubTypeID relation and that logic was handled in code rather than in the DB. Again, can be managed using 'flag' values or NULLs if possible. Those specific cases could be handled by designing the DB right and then working around a unique/odd situation in code instead of putting all the dependence on the code.
NoSQL: Original designer may be under the impression that such foreign keys or relations cannot be done in a NoSQL db.
And the obvious 'people' problem vs technical challenge: The original designer may not have had a proper understanding of databases and may have been a programmer who did that application (or was made to do it) without the right knowledge or assistance.
Just to put it out there: If the previous designer was an external contractor, he may have used the code maintenance complexity or 'support' clause as a means to get more business/money.

As a general rule of thumb, I'd say that keeping all the related data in a DB is a better practice since it removes a tacit dependency between the DB and your app, and because it makes the DB more "comprehensible." If the definitions of the SubTypeIDs are in a lookup table it becomes possible to create queries that return human-readable results, etc.
That said, the right answer probably depends a bit on the specifics of the application. If there's very tight coupling between the DB and app to begin with (eg, if the DB isn't going to be accessed by other clients) this is probably a minor concern particularly if the set of SubTypeIDs is small and seldom changes.

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?

A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?

you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.

Alternatives to Entity-Attribute-Value (EAV)?

Our database is designed based on EAV (Entity-Attribute-Value) model. Those who have worked with EAV models know all the crap that comes with for the purpose of flexibility.
I asked my client about the reasons why using EAV model (flexibility), and their response was: Their entities change over time. So, today they may have a table with a few attributes, but in a month time, a few new attributes may be added, or an existing attribute may be renamed. They need to produce reports to get back to any stage in time and query the data based on the shape of entities at that stage.
I understand this is not feasible with a conventional relational model, but I personally see EAV as anti-pattern. Are there any other alternative models that enables us to capture the time dimension in changes to the entities and instances?
Cheers,
Mosh

There is a difference between EAV done faithfully or badly; 5NF done by skilled people or by those who are clueless.
Sixth Normal Form is the Irreducible Normal Form (no further Normalisation is possible). It eliminates many of the problems that are common, such as The Null Problem, and provides the ultimate method identifying missing values. It is the academically and technically robust NF. There are no products to support it, and it is not commonly used. To be implemented properly and consistently, it requires a catalogue for metadata to be implemented. Of course, the SQL required to navigate it becomes even more cumbersome (SQL already being cumbersome re joins), but this is easily overcome by automating the production of SQL from the metadata.
EAV is a partial set or a subset of 6NF. The problem is, usually it is done for a purpose (to allow columns to be added without having to make DDL changes), and by people who are not aware of the 6NF, and who do not implement metadata. The point is, 6NF and EAV as principles and concepts offer substantial benefits, and performance increases; but commonly it is not implemented properly, and the benefits are not realised. Quite a few EAV implementations are disasters, not because EAV is bad, but because the implementation is poor.
Eg. Some people think that the SQL required to construct the 3NF rows from the 6NF/EAV database is complex: no, it is cumbersome but not complex. More important, an ordinary SQL VIEW can be provided, so that all users and report tools see only the straight 3NF VIEW, and the 6NF/EAV issues are transparent to them. Last, the SQL required can be automated, so the labour cost that many people endure is quite unnecessary.
So the answer really is, Sixth Normal Form, being the father of EAV, and a purer form, is the replacement for it. The Caveat is, ensure it is done properly. I have one large 6NF db, and it suffers none of the problems people post about, it performs beautifully, the customer is very happy (no further work is a sign of complete functional satisfaction).
I have already posted a very detailed answer to another question which applies to your question as well, which you may be interested in.
Other EAV Question

Regardless of the kind of relational model you use, tracking field name changes requires a lot of meta data which you must keep track of in either transaction logs or audit tables. Unfortunately, querying either of those for state at a particular date is very complicated. If your client only requires state at a particular time date however, meaning the entire state, not just with respect to name changes, you can duplicate the database and roll back the transaction log to the particular time required and run your queries on the new instance. If entities added after the specified date need to show up in the query with the old field names however, you have a very large engineering problem ahead of you. In that case, with the information you provided in your question, I would suggest either negotiating alternatives with the client or getting more information about the use of the reports to find alternative solutions.
You could move to a document based datastore, but that still wouldn't solve the problem in the second case. Sorry this isn't really an answer, but having worked through similar situations, the client likely needs a more realistic reporting solution or a number of other investors willing to front the capital for the engineering.
When this problem came up for us, we kept the db schema constant and implemented an entity mapping factory based on a timestamp. In the end, the client continually changed requirements (on a weekly to monthly basis) as to how aggregate fields were calculated and were never fully satisfied.

To add to the answers from #NickLarsen and #PerformanceDBA
If you need to track historical changes to things like field name, you may want to look into something like Slowly Changing Dimensions. It appears to me like you are using the EAV to model dynamic dimensional models (probably lookup lists).
The simplest (and probably least efficient) way of achieving this would be to include an "as of" date field on EAV tables, and whenever a change occurs, insert a new record (instead of updating an existing record) with the current date. This means that you need to alter your queries to always include or look for an "as of" date, or deafult to "now" if none provided. Your base entity that joins to the EAV objects would then have to query "top 1" from the EAV table where "as of" date is less than or equal to the 'last updated' date of the row, ordered by "as of" descending. Worst case scenario, if you need to track the most recent change to a given row where both the name (stored in the 'attribute' table) and the value have changed, you would chain this logic to the value table using 'last modified' of the row to find the appropriate value for that particular date.
This obviously has the potential to generate LARGE amounts of data if there are a lot of changes. That's why this approach is referred to as "slowly" changing. It's intended for dimensional values that may change, but not very often. To help with query performance, indexes on the "as of" and "last modified" fields should help.

If your client needs such flexibility, then a relational database might not be the right match.
Consider MongoDB where JSON structures are stored. You can add or not add fields without limitations. You can even use nesting.

Create a new table description for each Entity description Version
and one additional table that tells you which table is which version.
The query system should be updated as well.
I think creating a script that generates, tables and queries is your best shot.

Adding relations to an Access Database

I have an MS Access database with plenty of data. It's used by an application me and my team are developing. However, we've never added any foreign keys to this database because we could control relations from the code itself. Never had any problems with this, probably never will either.
However, as development has developed further, I fear there's a risk of losing sight over all the relationships between the 30+ tables, even though we use well-normalized data. So it would be a good idea go get at least the relations between the tables documented.
Altova has created DatabaseSpy which can show the structure of a database but without the relations, there isn't much to display. I could still use to add relations to it all but I don't want to modify the database itself.
Is there any software that can analyse a database by it's structures and data and then do a best-guess about its relations? (Just as documentation, not to modify the database.)
This application was created more than 10 years ago and has over 3000 paying customers who all use it. It's actually document-based, using an XML document for it's internal storage. The database is just used as storage and a single import/export routine converts it back and to XML. Unfortunately, the XML structure isn't very practical to use for documentation and there's a second layer around this XML document to expose it as an object model. This object model is far from perfect too, but that's what 10 years of development can do to an application. We do want to improve it but this takes time and we can't disappoint the current users by delaying new updates.Basically, we're stuck with its current design and to improve it, we need to make sure things are well-documented. That's what I'm working on now.

Only 30+ tables? Shouldn't take but a half hour or an hour to create all the relationships required. Which I'd urge you to do. Yes, I know that you state your code checks for those. But what if you've missed some? What if there are indeed orphaned records? How are you going to know? Or do you have bullet proof routines which go through all your tables looking for all these problems?
Use a largish 23" LCD monitor and have at it.

If your database does not have relationships defined somewhere other than code, there is no real way to guess how tables relate to each other.
Worse, you can't know the type of relationship and whether cascading of update and deletion should occur or not.
Having said that, if you followed some strict rules for naming your foreign key fields, then it could be possible to reconstruct the structure of the relationships.
For instance, I use a scheme like this one:
Table Product
- Field ID /* The Unique ID for a Product */
- Field Designation
- Field Cost
Table Order
- Field ID /* the unique ID for an Order */
- Field ProductID
- Field Quantity
The relationship is easy to detect when looking at the Order: Order.ProductID is related to Product.ID and this can easily be ascertain from code, going through each field.
If you have a similar scheme, then how much you can get out of it depends on how well you follow your own convention, but it could go to 100% accuracy although you're probably have some exceptions (that you can build-in your code or, better, look-up somewhere).
The other solution is if each of your table's unique ID is following a different numbering scheme.
Say your Order.ID is in fact following a scheme like OR001, OR002, etc and Product.ID follows PD001, PD002, etc.
In that case, going through all fields in all tables, you can search for FK records that match each PK.
If you're following a sane convention for naming your fields and tables, then you can probably automate the discovery of the relations between them, store that in a table and manually go through to make corrections.
Once you're done, use that result table to actually build the relationships from code using the Database.CreateRelation() method (look up the Access documentation, there is sample code for it).

You can build a small piece of VBA code, divided in 2 parts:
Step 1 implements the database relations with the database.createrelation method
Step 2 deleted all created relations with the database.delete command
As Tony said, 30 tables are not that much, and the script should be easy to set. Once this set, stop the process after step 1, run the access documenter (tools\analyse\documenter) to get your documentation ready, launch step 2. Your database will then be unchanged and your documentation ready.
I advise you to keep this code and run it regularly against your database to check that your relational model sticks to the data.

There might be a tool out there that might be able to "guess" the relations but I doubt it. Frankly I am scared of databases without proper foreign keys in particular and multi user apps that uses Access as a DBMS as well.
I guess that the app must be some sort of internal tool, otherwise I would suggest that you move to a proper DBMS ( SQL Express is for free) and adds the foreign keys.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight