How to handle this DB schema? - database

Using SO as a prime example, let's say a question or answer to a question is deleted but garnered a few up-votes before it was deleted. I imagine these points are still awarded to the author (if they aren't, let's suppose they are), then how does SO keep an accurate reputation total?
Are the questions/answers actually not deleted from the DB itself, and perhaps have a status field that is processed and decides whether a question or answer is visible?
Or, they are, in fact, deleted and the reputation relies on the system being continuously accurate as each vote is counted and doesn't necessarily have a history of it (like a question that recorded the vote)

SO uses a combination of soft and hard deletes, to the best of my knowledge. I can say for sure that I've lost reputation that was gained on questions deleted by either the poster or the moderator. That is not the point of your question, however, so...
If you want to be able to deduce an accurate total, especially if you want to be able to account for that total (the way SO lets you do by looking at your points history) then you need to keep transactional information, not a running total.
If you want to have referential integrity for the transactional log of points then you will need to use a soft-delete mechanism to hide questions that are "deleted".
If you don't keep the transactional log and you don't have soft delete-able questions to back up your transactional points log, then you won't be able to either recalculate or justify point totals. You'll also have a much harder time displaying a graph of points awarded over time and accumulated reputation over time. You could do these graphs by keeping a daily point snapshot, but that would be much more onerous and costly in terms of storage than just tracking up and down votes.

Related

Is it good practice to store a calculated value?

I'm working on a billing system, and calculating the total amount of an invoice on the database requires quite a bit of SQL. For instance:
some items are taxable, some aren't;
some items have discounts;
some are calculated dividing the price by an interval (e.g. € 30,00/month);
invoices may have overdue fees;
invoices have different tax rates.
Since performing queries is becoming more and more complex with every feature I add, I'm thinking about storing some calculations (net and gross amounts for invoice items and for the invoice). I've seen some invoicing frameworks doing it, so I thought it's not a bad practice per se.
However, I'm a bit worried about data integrity.
Cache invalidation in my application shouldn't be too hard: whenever an invoice gets changed somehow, I re-run the calculations and save the new values. But what if, someday, someone runs a script or some SQL code directly on the database?
I know there are some questions about the topic, but I'd like to discuss it further.
Yes, caching is fine as long as you don't forget to invalidate it when necessary (which is one of the two hardest problems in CS).
But what if, someday, someone runs a script or some SQL code directly on the database?
Well, with great power comes great responsibility. If you don't want responsibility, take the power away (forbid direct sql access). There's really nothing else you can do here.

In what circumstances support of transaction is not that important?

After posting the question here, I got to know that NoSQL are better at scaling out because they make a trade off between support for transaction and scalability.
So I wonder in what circumstances transactions are not that important so that scalability is more preferable to support of transaction?
Well, I would say first that NoSQL is better at scaling is some situations, but not all.
Full ACID transactions are Atomic, Consistent, Isolated and Durable. If you lose transactions, you will loose some or all of ACID within the datastore.
There are many ways to restore these functions with other asynchronous systems like message queues that themselves are durable. You can shove data onto a durable message queue, pop the data and deal with it in your NoSQL, then, when you can confirm it's stored to your required minimum, you can flag the message as consumed. It's the D in ACID, but distributed and asynchronous. There are ways to ensure the others, but they are often sacrificed to some extent, or moved into another place in the system. With some NoSQL solutions, you just have to move consistency into the application so it doesn't try to store invalid data.
When you start moving away from database driven transactions, you must increase your application testing dramatically to ensure your system doesn't fail (for some values of fail).
There are essentially no situations where transactions and constraints are not important in a system that has both read and write requirements. If they weren't you wouldn't care about your data at all (and some people don't, but regret it later). There are however levels of "caring". It's just a matter of how you end up at ACID or some pseudo-ACID that's "good enough". RDMBS makes caring about your data cheap. NoSQL makes caring about your data expensive, but, it makes scaling cheap(er) (in some cases). There are many companies with multi-terabyte database in RDBMSes, so to say unilaterally that "they don't scale" is simply inaccurate. Multi-terabyte SQL databases however, can cost lots of money, depending on the use case (you can after all just slap a RAID 10 array with a few 3TB drives onto a computer and throw a database engine on it. Might take several minutes to a few hours to do any kind of table scan on a big table, or even indexed look-up though, but if you don't care, it's cheap and multi-terabyte).
The biggest category is read-only type queries, where an aborted or botched transaction can simply be repeated. Anything where you are changing an underlying state, or want to guarantee once and only once activity, should have proper transactional semantics.
That is, "I want to order one widget, charge my credit card" should be a proper transaction: I don't want my card charged unless the widget is ordered, and the vendor doesn't want the widget sent unless the card is charged. "Report the shipment status of order xyz" doesn't need to be transactional -- if I don't get an answer, I can hit reload.
Much of it is just a bit of lateral thinking.
Thw whole point of transcation is you wrap up several operations, and should any fail all that have succeeded get rolled back, and while the transaction is in progress, records are locked and unless you have read uncommitted going, you don't see any on the individual changes of state until the transaction is committed.
Doing all that with distributed systems is expensive, because you need one 'central' and difficult to scale point that needs to 'know' all about the others.
So instead or Order this, charge my card, and show me my current balance.
You do Try to order this, if it's instock charge my card, and if my card gets charged the current known balance will be this.
There's a risk, that the order will be placed, put payment fail, so you need to deal with that. There's a risk that the proposed balance of the card my not be entirely accurate, hence add weasel words and show the potential effect of payment as opposed to the result.
It's not so much are transactions important, it's seeing as they aren't as well supported in NoSQL systems, where/how can I get away with not using them.

2 Questions about Philosophy and Best Practices in Database Development

Which one is best, regarding the implementation of a database for a web application: a lean and very small database with only the bare information, sided with a application that "recalculates" all the secondary information, on demand, based on those basic ones, OR, a database filled with all those secondary information already previously calculated, but possibly outdated?
Obviously, there is a trade-of there and I think that anyone would say that the best answer to this question is: "depends" or "is a mix between the two". But I'm really not to comfortable or experienced enough to reason alone about this subject. Could someone share some thoughts?
Also, another different question:
Should a database be the "snapshot" of a particular moment in time or should a database accumulate all the information from previous time, allowing the retrace of what happened? For instance, let's say that I'm modeling a Bank Account. Should I only keep the one's balance on that day, or should I keep all the one's transactions, and from those transactions infer the balance?
Any pointer on this kind of stuff that is, somehow, more deep in database design?
Thanks
My quick answer would be to store everything in the database. The cost of storage is far lower than the cost of processing when talking about very large scale applications. On small scale applications, the data would be far less, so storage would still be an appropriate solution.
Most RDMSes are extremely good at handling vast amounts of data, so when there are millions/trillions of records, the data can still be extracted relatively quickly, which can't be said about processing the data manually each time.
If you choose to calculate data rather than store it, the processing time doesn't increase at the same rate as the size of data does - the more data ~ the more users. This would generally mean that processing times would multiply by the data's size and the number of users.
processing_time = data_size * num_users
To answer your other question, I think it would be best practice to introduce a "snapshot" of a particular moment only when data amounts to such a high value that processing time will be significant.
When calculating large sums, such as bank balances, it would be good practice to store the result of any heavy calculations, along with their date stamp, to the database. This would simply mean that they will not need calculating again until it becomes out of date.
There is no reason to ever have out of date pre-calulated values. That's what trigger are for (among other things). However for most applications, I would not start precalculating until you need to. It may be that the calculation speed is always there. Now in a banking application, where you need to pre-calculate from thousands or even millions of records almost immediately, yes, design a precalulation process bases on triggers that adjust the values every time they are changed.
As to whether to store just a picture in time or historical values, that depends largely on what you are storing. If it has anything to do with financial data, store the history. You will need it when you are audited. Incidentally, design to store some data as of the date of the action (this is not denormalization). For instance, you have an order, do not rely onthe customer address table or the product table to get data about where the prodcts were shipped to or what they cost at the time of the order. This data changes over time and then you orders are no longer accurate. You don't want your financial reports to change the dollar amount sold because the price changed 6 months later.
There are other things that may not need to be stored historically. In most applications we don't need to know that you were Judy Jones 2 years ago and are Judy Smith now (HR application are usually an exception).
I'd say start off just tracking the data you need and perform the calculations on the fly, but throughout the design process and well into the test/production of the software keep in mind that you may have to switch to storing the pre-calculated values at some point. Design with the ability to move to that model if the need arises.
Adding the pre-calculated values is one of those things that sounds good (because in many cases it is good) but might not be needed. Keep the design as simple as it needs to be. If performance becomes an issue in doing the calculations on the fly, then you can add fields to the database to store the calculations and run a batch overnight to catch up and fill in the legacy data.
As for the banking metaphor, definitely store a complete record of all transactions. Store any data that's relevant. A database should be a store of data, past and present. Audit trails, etc. The "current state" can either be calculated on the fly or it can be maintained in a flat table and re-calculated during writes to other tables (triggers are good for that sort of thing) if performance demands it.
It depends :) Persisting derived data in the database can be useful because it enables you to implement constraints and other logic against it. Also it can be indexed or you may be able to put the calculations in a view. In any case, try to stick to Boyce-Codd / 5th Normal Form as a guide for your database design. Contrary to what you may sometimes hear, normalization does not mean you cannot store derived data - it just means data shouldn't be derived from nonkey attributes in the same table.
Fundamentally any database is a record of the known facts at a particular point in time. Most databases include some time component and some data is preserved whereas some is not - requirements should dictate this.
You've answered your own question.
Any choices that you make depend on the requirements of the application.
Sometimes speed wins, sometimes space wins. Sometime data accuracy wins, sometimes snapshots win.
While you may not have the ability to tell what's important, the person you're solving the problem for should be able to answer that for you.
I like dynamic programming(not calculate anything twise). If you're not limited with space and are fine with a bit outdated data, then precalculate it and store in the DB. This will give you additional benefit of being able to run sanity checks and ensure that data is always consistent.
But as others already replied, it depends :)

Database design: columns with duplicate meaning

maybe the title of the question don't describe my problem very well but here it is :
let's say i have a table article that look like this:
+ title
+ author
.
.
.
+ status [choices : ('draft', ..., 'translate')]
And let's say in my business process i publish in my web page articles that have [status = 'translate']
Is it a good design decision to add another field :
+ read [bool]
to my table that's mean that the article is ready to be publish or is it a bad design because i can test on the status == 'translate' for that and the new field will be just a duplicate ??
i hope that my question is clear , and thanks in advance.
Here's a fundamental DB design concept (it's actually a part of making your table comply with 3NF): Non of your columns should depend on anything but the primary key of the table. That should answer your question.
Here's a good quote to remember that:
every non-key attribute
"must provide a fact about the key,
the whole key, and nothing but the key
so help me Codd".
(that's also from wiki)
The reason for that, is that breaking this law might compromise Data Integrity.
Bad design.
First, what you do have here is a field that is basically the current state of a state engine.
Second, status should be a separate table - do NOT put status texts in the same table. You can then add additional info for every possible status to the status table.
Duplicate. If you can manage without a column, don't use it.
Think about the overhead it adds to your database (and besides, a boolean column cannot be indexed, so it also won't increase yourperformance).
(And of course, replace the status strings with numeric values).
Good luck.
In order for there not to be a potential data integrity conflict, you could make the "ready" column a computed column, or you could make a view which provides this translation service.
However, for this particular design, I would put the states into a table and have an IsReady column in the state table. Then you could add different states which are all IsReady. I have used designs like this many times, where certain states are equivalent for some operations, but not for others. Each has a flag. In my particular case, many batches in different states were allowed to be counted as "successful" for average timing/performance purposes, but batches which had completely successfully but were later invalidated were not considered "successful", etc.
This case has a name in normalization theory. It's called "harmful redundancy". Here are some potential drawbacks to harmful redundancy: the database can contradict itself; too much space is wasted; too much time is wasted.
Contradictions in the database gets the most air time in tutorials on database design. You have to take some measures to prevent this situation from arising, or live with the consequences. You can rely on careful programming to keep contradictions out of the database, or you can declare constraints that will prevent any transaction from leaving the database in a contradictory state.
The waste of space is usually, but not always, a trivial cost. Wasted space can result in wasted time as a consequence.
The waste of time is the one that preoccupies programmers the most. But here, the issue gets to be more subtle. Sometimes "harmful redundancy" results in saving time, not wasting it. Most often, it results in extra time during updates, but time saving during retrieval. Often, the time savings or wastage is trivial, and therefore so is the design decision, from the point of view of speed.
In your case, the speed consequences should be minimal. Here's a hint: how often do you update rows? How often do your read them? how important is speed in updating or reading? If the time you gain during reading has more weight for yuo than the time you spend updating, then go for it.

BALD-D battle Against Bad Database Design

I'm no DBA, but I respect database theory. Isn't adding columns like isDeleted and sequenceOrder bad database practice?
That depends. Being able to soft-delete a tuple (i.e., mark it as deleted rather then actually deleting it) is essential if there's any need to later access that tuple (e.g., to count deleted things, or do some type of historical analysis). It also has the possible benefit, depending on how indexes are structured, to cause less of a disk traffic hit when soft-deleting a row (by having to touch fewer indexes). The downside is that the application takes on responsibility for managing foreign keys to soft-deleting things.
If soft deleting is done for performance, a periodic (e.g., nightly, weekly) tasks can clean soft-deleted tuples out during a low-traffic period.
Using an explicit 'sequence order' for some tuples is useful in several cases, esp. when it's not possible or wise to depend on some other field (e.g., ids, which app developers are trained not to trust) to order things that need to be ordered in some specific way for business reasons.
IsDeleted columns have two purposes.
To hide a record from users instead of deleting it, thus retaining the record in the database for later use.
To provide a two-stage delete process, where one user marks a record for deletion, and another user confirms.
Not sure what SequenceOrder is about. Do you have a specific application in mind?
Absolutely not. Each database has different requirements, and based on those requirements, you may need columns such as those.
An example for isDeleted could be if you want to allow the user interface to delete unneeded things, but retain them in the database for auditing or reporting purposes. Or if you have incredibly large datasets, deleting is a very slow operation and may not be possible to perform in real-time. In this case, you can mark it deleted, and run a batch clean-up periodically.
An example for sequenceOrder is to enable arbitrary sorting of database rows in the UI, without relying on intrinsic database order, or sequental insertion. If you insert rows in order, you can usually get them out of order..until people start deleting and inserting new rows.
SequenceOrder doesn't sound great (although you've given no background at all), but I've used columns like IsDeleted for soft deletions all my career.
Since you explicitly state that you're interested in the theoretical perspective, here goes :
At the level of the LOGICAL design, it is almost by necessity a bad idea to have a boolean attribute in a table (btw the theory's correct term for this is "relvar", not "table"). The reason being that having a boolean attribute makes it very awkward to define/document the meaning (relational theory names this the "Predicate") that the relvar has in your system. If you include the boolean attribute, then the predicate defining such a relvar's meaning would have to include some construct like "... and it is -BOOLEANATTRIBUTENAME here- that this tuple has been deleted.". That is awkward circumlocution.
At the logical design level, you should have two distinct tables, one for the non-deleted rows, and one for the deleted-rows-that-someone-might-still-be-interested-in.
At the PHYSICAL desing level, things may be different. If you have a lot of delete-and-undelete, or even just a lot of delete activity, then physically having two distinct tables is likely to be a bad idea. One table with a boolean attribute that acts as a "distinguishing key" between the two logical tables might indeed be better. If otoh, you have a lot of query activity that only needs the non-deleted ones, and the volume of deleted ones is usually large in comparison to the non-deleted ones, it might be better to keep them apart physically too (and bite the bullet about the probably worse update performance you'll get - if that were noticeable).
But you said you were interested in the theoretical perspective, and theory (well, as far as I know it) has actually very little to say about matters of physical design.
Wrt the sequenceOrder column, that really depends on the particular situation. I guess that most of the time, you wouldn't need them, because ordering of items as required by the business is most likely to be on "meaningful" data. But I could imagine sequenceOrder columns getting used to mimick insertion timestamps and the like.
Backing up what others have said, both can have their place.
In our CRM system I have an isDeleted - like field in our customer table so that we can hide customers we are no longer servicing while leaving all the information about them in the database. We can easily restore deleted customers and we can strictly enforce referential integrity. Otherwise, what happens when you delete a customer but do not want to delete all records of the work you have done for them? Do you leave references to the customer dangling?
SequenceOrder, again, is useful to allow user-defined ordering. I don't think I use it anywhere, but suppose you had to list say your five favorite foods in order. Five tasks to complete in the order they need to be completed. Etc.
Others have adequately tackled isDeleted.
Regarding sequenceOrder, business rules frequently require lists to be in an order that may not be determined by the actual data.
Consider a table of Priority statuses. You might have rows for High, Low, and Medium. Ordering the the description will give you either High, Low, Medium or Medium, Low, High.
Obviously that order does not give information about the relationship that exists between the three records. Instead you would need a sequenceOrder field so that it makes sense. So that you end up with [1] High, [2] Medium, [3] Low; or the reverse.
Not only does this help with human readability, but system processes can now give appropriate weight to each one.

Resources