Database design: columns with duplicate meaning - database

maybe the title of the question don't describe my problem very well but here it is :
let's say i have a table article that look like this:
+ title
+ author
.
.
.
+ status [choices : ('draft', ..., 'translate')]
And let's say in my business process i publish in my web page articles that have [status = 'translate']
Is it a good design decision to add another field :
+ read [bool]
to my table that's mean that the article is ready to be publish or is it a bad design because i can test on the status == 'translate' for that and the new field will be just a duplicate ??
i hope that my question is clear , and thanks in advance.

Here's a fundamental DB design concept (it's actually a part of making your table comply with 3NF): Non of your columns should depend on anything but the primary key of the table. That should answer your question.
Here's a good quote to remember that:
every non-key attribute
"must provide a fact about the key,
the whole key, and nothing but the key
so help me Codd".
(that's also from wiki)
The reason for that, is that breaking this law might compromise Data Integrity.

Bad design.
First, what you do have here is a field that is basically the current state of a state engine.
Second, status should be a separate table - do NOT put status texts in the same table. You can then add additional info for every possible status to the status table.

Duplicate. If you can manage without a column, don't use it.
Think about the overhead it adds to your database (and besides, a boolean column cannot be indexed, so it also won't increase yourperformance).
(And of course, replace the status strings with numeric values).
Good luck.

In order for there not to be a potential data integrity conflict, you could make the "ready" column a computed column, or you could make a view which provides this translation service.
However, for this particular design, I would put the states into a table and have an IsReady column in the state table. Then you could add different states which are all IsReady. I have used designs like this many times, where certain states are equivalent for some operations, but not for others. Each has a flag. In my particular case, many batches in different states were allowed to be counted as "successful" for average timing/performance purposes, but batches which had completely successfully but were later invalidated were not considered "successful", etc.

This case has a name in normalization theory. It's called "harmful redundancy". Here are some potential drawbacks to harmful redundancy: the database can contradict itself; too much space is wasted; too much time is wasted.
Contradictions in the database gets the most air time in tutorials on database design. You have to take some measures to prevent this situation from arising, or live with the consequences. You can rely on careful programming to keep contradictions out of the database, or you can declare constraints that will prevent any transaction from leaving the database in a contradictory state.
The waste of space is usually, but not always, a trivial cost. Wasted space can result in wasted time as a consequence.
The waste of time is the one that preoccupies programmers the most. But here, the issue gets to be more subtle. Sometimes "harmful redundancy" results in saving time, not wasting it. Most often, it results in extra time during updates, but time saving during retrieval. Often, the time savings or wastage is trivial, and therefore so is the design decision, from the point of view of speed.
In your case, the speed consequences should be minimal. Here's a hint: how often do you update rows? How often do your read them? how important is speed in updating or reading? If the time you gain during reading has more weight for yuo than the time you spend updating, then go for it.

Related

Does storing aggregated data go against database normalization?

On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.

2 Questions about Philosophy and Best Practices in Database Development

Which one is best, regarding the implementation of a database for a web application: a lean and very small database with only the bare information, sided with a application that "recalculates" all the secondary information, on demand, based on those basic ones, OR, a database filled with all those secondary information already previously calculated, but possibly outdated?
Obviously, there is a trade-of there and I think that anyone would say that the best answer to this question is: "depends" or "is a mix between the two". But I'm really not to comfortable or experienced enough to reason alone about this subject. Could someone share some thoughts?
Also, another different question:
Should a database be the "snapshot" of a particular moment in time or should a database accumulate all the information from previous time, allowing the retrace of what happened? For instance, let's say that I'm modeling a Bank Account. Should I only keep the one's balance on that day, or should I keep all the one's transactions, and from those transactions infer the balance?
Any pointer on this kind of stuff that is, somehow, more deep in database design?
Thanks
My quick answer would be to store everything in the database. The cost of storage is far lower than the cost of processing when talking about very large scale applications. On small scale applications, the data would be far less, so storage would still be an appropriate solution.
Most RDMSes are extremely good at handling vast amounts of data, so when there are millions/trillions of records, the data can still be extracted relatively quickly, which can't be said about processing the data manually each time.
If you choose to calculate data rather than store it, the processing time doesn't increase at the same rate as the size of data does - the more data ~ the more users. This would generally mean that processing times would multiply by the data's size and the number of users.
processing_time = data_size * num_users
To answer your other question, I think it would be best practice to introduce a "snapshot" of a particular moment only when data amounts to such a high value that processing time will be significant.
When calculating large sums, such as bank balances, it would be good practice to store the result of any heavy calculations, along with their date stamp, to the database. This would simply mean that they will not need calculating again until it becomes out of date.
There is no reason to ever have out of date pre-calulated values. That's what trigger are for (among other things). However for most applications, I would not start precalculating until you need to. It may be that the calculation speed is always there. Now in a banking application, where you need to pre-calculate from thousands or even millions of records almost immediately, yes, design a precalulation process bases on triggers that adjust the values every time they are changed.
As to whether to store just a picture in time or historical values, that depends largely on what you are storing. If it has anything to do with financial data, store the history. You will need it when you are audited. Incidentally, design to store some data as of the date of the action (this is not denormalization). For instance, you have an order, do not rely onthe customer address table or the product table to get data about where the prodcts were shipped to or what they cost at the time of the order. This data changes over time and then you orders are no longer accurate. You don't want your financial reports to change the dollar amount sold because the price changed 6 months later.
There are other things that may not need to be stored historically. In most applications we don't need to know that you were Judy Jones 2 years ago and are Judy Smith now (HR application are usually an exception).
I'd say start off just tracking the data you need and perform the calculations on the fly, but throughout the design process and well into the test/production of the software keep in mind that you may have to switch to storing the pre-calculated values at some point. Design with the ability to move to that model if the need arises.
Adding the pre-calculated values is one of those things that sounds good (because in many cases it is good) but might not be needed. Keep the design as simple as it needs to be. If performance becomes an issue in doing the calculations on the fly, then you can add fields to the database to store the calculations and run a batch overnight to catch up and fill in the legacy data.
As for the banking metaphor, definitely store a complete record of all transactions. Store any data that's relevant. A database should be a store of data, past and present. Audit trails, etc. The "current state" can either be calculated on the fly or it can be maintained in a flat table and re-calculated during writes to other tables (triggers are good for that sort of thing) if performance demands it.
It depends :) Persisting derived data in the database can be useful because it enables you to implement constraints and other logic against it. Also it can be indexed or you may be able to put the calculations in a view. In any case, try to stick to Boyce-Codd / 5th Normal Form as a guide for your database design. Contrary to what you may sometimes hear, normalization does not mean you cannot store derived data - it just means data shouldn't be derived from nonkey attributes in the same table.
Fundamentally any database is a record of the known facts at a particular point in time. Most databases include some time component and some data is preserved whereas some is not - requirements should dictate this.
You've answered your own question.
Any choices that you make depend on the requirements of the application.
Sometimes speed wins, sometimes space wins. Sometime data accuracy wins, sometimes snapshots win.
While you may not have the ability to tell what's important, the person you're solving the problem for should be able to answer that for you.
I like dynamic programming(not calculate anything twise). If you're not limited with space and are fine with a bit outdated data, then precalculate it and store in the DB. This will give you additional benefit of being able to run sanity checks and ensure that data is always consistent.
But as others already replied, it depends :)

database row/ record pointers

I don't know the correct words for what I'm trying to find out about and as such having a hard time googling.
I want to know whether its possible with databases (technology independent but would be interested to hear whether its possible with Oracle, MySQL and Postgres) to point to specific rows instead of executing my query again.
So I might initially execute a query find some rows of interest and then wish to avoid searching for them again by having a list of pointers or some other metadata which indicates the location on a database which I can go to straight away the next time I want those results.
I realise there is caching on databases, but I want to keep these "pointers" else where and as such caching doesn't ultimately solve this problem. Is this just an index and I store the index and look up by this? most of my current tables don't have indexes and I don't want the speed decrease that sometimes comes with indexes.
So whats the magic term I've been trying to put into google?
Cheers
In Oracle it is called ROWID. It identifies the file, the block number, and the row number in that block. I can't say that what you are describing is a good idea, but this might at least get you started looking in the right direction.
Check here for more info: http://www.orafaq.com/wiki/ROWID.
By the way, the "speed decrease that comes with indexes" that you are afraid of is only relevant if you do more inserts and updates than reads. Indexes only speed up reads, so if the read ratio is high, you might not have an issue and an index might be your best solution.
most of my current tables don't have
indexes and I don't want the speed
decrease that sometimes comes with
indexes.
And you also don't want the speed increase which usually comes with indexes but you want to hand-roll a bespoke pseudo-cache instead?
I'm not being snarky here, this is a serious point. Database designers have expended a great deal of skill and energy into optimizing their products. Wouldn't it be more sensible to learn how to take advantage of their efforts rather re-implementing some core features?
In general, the best way to handle this sort of requirement is to use the primary key (or in fact any convenient, compact unique identifier) as the 'pointer', and rely on the indexed lookup to be swift - which it usually will be.
You can use ROWID in more DBMS than just Oracle, but it generally isn't recommended for a variety or reasons. If you succumb to the 'every table has an autoincrement column' school of database design, then you can record the autoincrement column values as the identifiers.
You should have at least one index on (almost) all of your tables - that index will be for the primary key. The exception might be for a table so small that it fits in memory easily and won't be updated and will be used enough not to be evicted from memory. Then an index might be a distraction; however, such tables are typically seldom updated so the index won't harm anything, and the optimizer will ignore it if the index doesn't help (and it may not).
You may also have auxilliary indexes. In a system where most of the activity is reading the data, you may want to erro on the side of having more indexes rather than fewer, because access time is most critical. If your system was update intensive, then you would go with fewer indexes because there is a cost associated with updating indexes when data is added, removed or updated. Clearly, you need to design the indexes to work well with the queries that your users actually perform (or your applications perform).
You may also be interested in cursors. (Note that the index debate is still valid with cursors.)
Wikipedia definition here.

BALD-D battle Against Bad Database Design

I'm no DBA, but I respect database theory. Isn't adding columns like isDeleted and sequenceOrder bad database practice?
That depends. Being able to soft-delete a tuple (i.e., mark it as deleted rather then actually deleting it) is essential if there's any need to later access that tuple (e.g., to count deleted things, or do some type of historical analysis). It also has the possible benefit, depending on how indexes are structured, to cause less of a disk traffic hit when soft-deleting a row (by having to touch fewer indexes). The downside is that the application takes on responsibility for managing foreign keys to soft-deleting things.
If soft deleting is done for performance, a periodic (e.g., nightly, weekly) tasks can clean soft-deleted tuples out during a low-traffic period.
Using an explicit 'sequence order' for some tuples is useful in several cases, esp. when it's not possible or wise to depend on some other field (e.g., ids, which app developers are trained not to trust) to order things that need to be ordered in some specific way for business reasons.
IsDeleted columns have two purposes.
To hide a record from users instead of deleting it, thus retaining the record in the database for later use.
To provide a two-stage delete process, where one user marks a record for deletion, and another user confirms.
Not sure what SequenceOrder is about. Do you have a specific application in mind?
Absolutely not. Each database has different requirements, and based on those requirements, you may need columns such as those.
An example for isDeleted could be if you want to allow the user interface to delete unneeded things, but retain them in the database for auditing or reporting purposes. Or if you have incredibly large datasets, deleting is a very slow operation and may not be possible to perform in real-time. In this case, you can mark it deleted, and run a batch clean-up periodically.
An example for sequenceOrder is to enable arbitrary sorting of database rows in the UI, without relying on intrinsic database order, or sequental insertion. If you insert rows in order, you can usually get them out of order..until people start deleting and inserting new rows.
SequenceOrder doesn't sound great (although you've given no background at all), but I've used columns like IsDeleted for soft deletions all my career.
Since you explicitly state that you're interested in the theoretical perspective, here goes :
At the level of the LOGICAL design, it is almost by necessity a bad idea to have a boolean attribute in a table (btw the theory's correct term for this is "relvar", not "table"). The reason being that having a boolean attribute makes it very awkward to define/document the meaning (relational theory names this the "Predicate") that the relvar has in your system. If you include the boolean attribute, then the predicate defining such a relvar's meaning would have to include some construct like "... and it is -BOOLEANATTRIBUTENAME here- that this tuple has been deleted.". That is awkward circumlocution.
At the logical design level, you should have two distinct tables, one for the non-deleted rows, and one for the deleted-rows-that-someone-might-still-be-interested-in.
At the PHYSICAL desing level, things may be different. If you have a lot of delete-and-undelete, or even just a lot of delete activity, then physically having two distinct tables is likely to be a bad idea. One table with a boolean attribute that acts as a "distinguishing key" between the two logical tables might indeed be better. If otoh, you have a lot of query activity that only needs the non-deleted ones, and the volume of deleted ones is usually large in comparison to the non-deleted ones, it might be better to keep them apart physically too (and bite the bullet about the probably worse update performance you'll get - if that were noticeable).
But you said you were interested in the theoretical perspective, and theory (well, as far as I know it) has actually very little to say about matters of physical design.
Wrt the sequenceOrder column, that really depends on the particular situation. I guess that most of the time, you wouldn't need them, because ordering of items as required by the business is most likely to be on "meaningful" data. But I could imagine sequenceOrder columns getting used to mimick insertion timestamps and the like.
Backing up what others have said, both can have their place.
In our CRM system I have an isDeleted - like field in our customer table so that we can hide customers we are no longer servicing while leaving all the information about them in the database. We can easily restore deleted customers and we can strictly enforce referential integrity. Otherwise, what happens when you delete a customer but do not want to delete all records of the work you have done for them? Do you leave references to the customer dangling?
SequenceOrder, again, is useful to allow user-defined ordering. I don't think I use it anywhere, but suppose you had to list say your five favorite foods in order. Five tasks to complete in the order they need to be completed. Etc.
Others have adequately tackled isDeleted.
Regarding sequenceOrder, business rules frequently require lists to be in an order that may not be determined by the actual data.
Consider a table of Priority statuses. You might have rows for High, Low, and Medium. Ordering the the description will give you either High, Low, Medium or Medium, Low, High.
Obviously that order does not give information about the relationship that exists between the three records. Instead you would need a sequenceOrder field so that it makes sense. So that you end up with [1] High, [2] Medium, [3] Low; or the reverse.
Not only does this help with human readability, but system processes can now give appropriate weight to each one.

Am i right to sacrifice database design fundamentals in this case for the sake of speed?

I work in a company that uses single table Access database for its outbound cms, which I moved to a SQL server based system. There's a data list table (not normalized) and a calls table. This has about one update per second currently. All call outcomes along with date, time, and agent id are stored in the calls table. Agents have a predefined set of records that they will call each day (this comprises records from various data lists sorted to give an even spread throughout their set). Note a data list record is called once per day.
In order to ensure speed, live updates to this system are stored in a duplicate of the calls table fields in the data list table. These are then copied to the calls table in a batch process at the end of the day.
The reason for this is not obviously the speed at which a new record could be added to the calls table live, but when the user app is closed/opened and loads the user's data set again I need to check which records have not been called today - I would need to run a stored proc on the server that picked the last most call from the calls table and check if its calldate didn't match today's date. I believe a more expensive query than checking if a field in the data list table is NULL.
With this setup I only run the expensive query at the end of each day.
There are many pitfalls in this design, the main limitation is my inexperience. This is my first SQL server system. It's pretty critical, and I had to ensure it would work and I could easily dump data back to access db during a live failure. It has worked for 11 months now (and no live failure, less downtime than the old system).
I have created pretty well normalized databases for other things (with far fewer users), but I'm hesitant to implement this for the calling database.
Specifically, I would like to know your thoughts on whether the duplication of the calls fields in the data list table is necessary in my current setup or whether I should be able to use the calls table. Please try and answer this from my perspective. I know you DBAs may be cringing!
Redesigning an already working Database may become the major flaw here. Rather try to optimize what you have got running currently instead if starting from scratch. Think of indices, referential integrity, key assigning methods, proper usage of joins and the like.
In fact, have a look here:
Database development mistakes made by application developers
This outlines some very useful pointers.
The thing the "Normalisation Nazis" out there forget is that database design typically has two stages, the "Logical Design" and the "Physical Design". The logical design is for normalisation, and the physical design is for "now lets get the thing working", considering among other things the benefits of normalisation vs. the benefits of breaking nomalisation.
The classic example is an Order table and an Order-Detail table and the Order header table has "total price" where that value was derived from the Order-Detail and related tables. Having total price on Order in this case still make sense, but it breaks normalisation.
A normalised database is meant to give your database high maintainability and flexibility. But optimising for performance is one of the considerations that physical design considers. Look at reporting databases for example. And don't get me started about storing time-series data.
Ask yourself, has my maintainability or flexibility been significantly hindered by this decision? Does it cause me lots of code changes or data redesign when I change something? If not, and you're happy that your design is working as required, then I wouldn't worry.
I think whether to normalize it depends on how much you can do, and what may be needed.
For example, as Ian mentioned, it has been working for so long, is there some features they want to add that will impact the database schema?
If not, then just leave it as it is, but, if you need to add new features that change the database, you may want to see about normalizing it at that point.
You wouldn't need to call a stored procedure, you should be able to use a select statement to get the max(id) by the user id, or the max(id) in the table, depending on what you need to do.
Before deciding to normalize, or to make any major architectural changes, first look at why you are doing it. If you are doing it just because you think it needs to be done, then stop, and see if there is anything else you can do, perhaps add unit tests, so you can get some times for how long operations take. Numbers are good before making major changes, to see if there is any real benefit.
I would ask you to be a little more clear about the specific dilemma you face. If your system has worked so well for 11 months, what makes you think it needs any change?
I'm not sure you are aware of the fact that "Database design fundamentals" might relate to "logical database design fundamentals" as well as "physical database design fundamentals", nor whether you are aware of the difference.
Logical database design fundamentals should not (and actually cannot) be "sacrificed" for speed precisely because speed is only determined by physical design choices, the prime desision factor in which is precisely speed and performance.

Resources