Like most people, I work on a data-driven object-oriented business application. I use a relational database to store my data.
I have designed my application so far that I never store computed values. That is, if a user would like to consult the output of a "simulation" he ran last year, my application would simply recompute the report from existing historical data, instead of reading the result of the simulation. Since the report takes very little time to create - it's mostly simple arithmetic - can I safely assume I can get by without storing the result of their reports? I'm having a hard time figuring out a future business requirement which would make me regret not having stored the results in the first place.
Not storing the result of computations reduces redundancy and is generally, in my opinion, a good thing - but it comes down to a case of normalization vs. computational power required for an operation. Often, databases don't get 100% normalized because we don't live in the ideal world where that would be ideal (infinite CPU power and I/O speed), and so it should really be determined on a case-by-case basis.
If you can't foresee a need for storing the results of a computation in the DB, I'd suggest you don't store it. A DB is generally easier to maintain, the more normalized it is.
Tax changes, any other regulation that apply to All the data.
You can get around it by using values at the time, but when it involves any calculation that's changed complexity starts to build up.
If you have all the data you and can compute simulations based on that data then you should be fine. If in the future you find that these simulations are taking too long to run then you can begin storing the computed values and simply change your application to pull historical values from there.
Related
I am currently working on a project that requires us to store a large amount of time series data, but more importantly, retrieve large amounts of it quick.
There will be N devices (>10,000) which will periodically send data to the system, lets say every 5 seconds. This data will quickly build up, but we are generally only interested in the most recent data, and want to compact the older data. We don't want to remove it, as it is still useful, but instead of having thousands of data point for a day, we might save just 5 or 10 after N days/weeks/months have passed.
Specifically we want to be able to fetch sampled data over a large time period, say a year or two. There might be millions of points here, but we just want a small, linearly distributed, sample of this data.
Today we are experimenting with influxdb, which initially seemed like an alright solution. It was fast enough and allows us to store our data in a reasonable structure, but we have found that it is not completely satisfactory. We were unable to perform the sample query described above and in general the system does not feel mature enough for us.
Any advice on how we can proceed, or alternative solutions, is much appreciated.
You might be interested in looking at TimescaleDB:
https://github.com/timescale/timescaledb
It builds a time-series DB on top of Postgres and so offers full SQL support, as well as generally the Postgres ecosystem/reliability. This can give you a lot greater query flexibility, which sounds like you want.
In terms of your specific use case, there would really be two solutions.
First, what people typically would do is to create two "hypertables", one for raw data, another for sampled data. These hypertables look like standard tables to the user, although heavily partitioned under the covers for much better scalability (e.g., 20x insert throughput vs. postgres for large table sizes).
Then you basically do a roll-up from the raw to the sampled table, and use a different data retention policy on each (so you keep raw data for say 1 month, with sampled data for years).
http://docs.timescale.com/getting-started/setup/starting-from-scratch
http://docs.timescale.com/api/data-retention
Second, you can go with a single hypertable, and then just schedule a normal SQL query to delete individual rows from data that's older than a certain time period.
We might even in the future add better first-class support for this latter approach if it becomes a common-enough requested feature, although most use cases we've encountered to date seemed more focused on #1, esp. in order to to keep statistical data about removed data-points, as opposed to just straight samples.
(Disclaimer: I'm one of the authors of TimescaleDB.)
Here's what I can discern:
When values are going to be queried for quite often, it would be best to store them and update the stored values every time they change. The reason being that you'd increase performance by not having to recompute each time. I'm not sure when caching comes into play with database queries as this would be database dependent, hence this factor could drastically change my position
Less often queried values would make sense to be computed upon a query I'm guessing.
There's two extra factors that I can think of -- complimentary to both storing and computing. First would be the extra space associated with storing values and whether or not it would be trivial. If not, computing may be more viable. Second -- in a very similar way -- is the time cost associated with computing values and whether or not that would be trivial also -- which would make the query frequency more of a concern.
Are there any other reasons to choose one or the other between storing and computing database values?
If it's an operational / transactional database, like an ERP or e-commerce store, then you fully normalize, and don't store computed values.
If it's a data warehouse, then you can store computed values.
Premature optimization is premature. Write your application properly, measure performance, find bottlenecks, and then optimize.
In all the applications I have made where a database is used I typically store the calculated value along with the variables needed to calculate that value. For example, if I have tonnage and cost I would multiply them to calculate the total. I could just recalculate the value every time it is needed, I was just wondering if there was an standard approach. Either way is fine with me, I just want to do what is most common.
If I store the calculate variables it makes my domain classes a bit more complex, but makes my controller logic cleaner, if I don't store the calculated variables it is the other way around.
The calculations would not be extremely frequent, but may be moderately frequent, but math is cheap right?
The standard approach is not to store this kind of calculated values - it breaks normalization.
There are cases you want to store calculated values, if it takes too long to recalculate, or you are running a data warehouse etc. In your case, you want stick to the normalization rules.
This violates Normal Form to have this calculated value. Unless there is a reason to denormalize (usually performance constraints) then you should make every attempt to normalize your tables, it will make your database much easier to maintain/improve and denormalize may lock you into a design that is difficult to alter easily and exposes your data to inconsistencies and redundancy.
In my experience, the most common thing to do is to a) store the calculated value, b) without any CHECK constraints in the database that would guarantee that the value is correct.
The right thing to do is either
don't store the result of the calculation
store the calculated value in a column that's validated with a CHECK constraint.
MySQL doesn't support CHECK constraints. So your options are
don't store the result of the calculation
switch to a dbms that supports CHECK constraints, such as PostgreSQL.
It all depends on what resources are scarce in your environment. If you do pre-calculate the value, you'll save CPU time at the cost of increased network usage and DB storage space. These days, CPU time is generally much more abundant than network bandwidth and DB storage, so I'm going to guess that as long as the calculation isn't too complicated then pre-calculating the value is not worth it.
On the other hand, perhaps the value you're calculating takes a substantial amount of CPU. In this case, you may want to cache that value in the DB.
So, it depends on what you have and what you lack.
Simple math is relatively cheap, however you need to weigh up the additional storage cost vs performance saving when storing these values. Another thing you may want to consider is the affect this will have on data updates, where you cant simply just update the field value, you need to update the calculated value too.
On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks
Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.
The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.
Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.
Which one is best, regarding the implementation of a database for a web application: a lean and very small database with only the bare information, sided with a application that "recalculates" all the secondary information, on demand, based on those basic ones, OR, a database filled with all those secondary information already previously calculated, but possibly outdated?
Obviously, there is a trade-of there and I think that anyone would say that the best answer to this question is: "depends" or "is a mix between the two". But I'm really not to comfortable or experienced enough to reason alone about this subject. Could someone share some thoughts?
Also, another different question:
Should a database be the "snapshot" of a particular moment in time or should a database accumulate all the information from previous time, allowing the retrace of what happened? For instance, let's say that I'm modeling a Bank Account. Should I only keep the one's balance on that day, or should I keep all the one's transactions, and from those transactions infer the balance?
Any pointer on this kind of stuff that is, somehow, more deep in database design?
Thanks
My quick answer would be to store everything in the database. The cost of storage is far lower than the cost of processing when talking about very large scale applications. On small scale applications, the data would be far less, so storage would still be an appropriate solution.
Most RDMSes are extremely good at handling vast amounts of data, so when there are millions/trillions of records, the data can still be extracted relatively quickly, which can't be said about processing the data manually each time.
If you choose to calculate data rather than store it, the processing time doesn't increase at the same rate as the size of data does - the more data ~ the more users. This would generally mean that processing times would multiply by the data's size and the number of users.
processing_time = data_size * num_users
To answer your other question, I think it would be best practice to introduce a "snapshot" of a particular moment only when data amounts to such a high value that processing time will be significant.
When calculating large sums, such as bank balances, it would be good practice to store the result of any heavy calculations, along with their date stamp, to the database. This would simply mean that they will not need calculating again until it becomes out of date.
There is no reason to ever have out of date pre-calulated values. That's what trigger are for (among other things). However for most applications, I would not start precalculating until you need to. It may be that the calculation speed is always there. Now in a banking application, where you need to pre-calculate from thousands or even millions of records almost immediately, yes, design a precalulation process bases on triggers that adjust the values every time they are changed.
As to whether to store just a picture in time or historical values, that depends largely on what you are storing. If it has anything to do with financial data, store the history. You will need it when you are audited. Incidentally, design to store some data as of the date of the action (this is not denormalization). For instance, you have an order, do not rely onthe customer address table or the product table to get data about where the prodcts were shipped to or what they cost at the time of the order. This data changes over time and then you orders are no longer accurate. You don't want your financial reports to change the dollar amount sold because the price changed 6 months later.
There are other things that may not need to be stored historically. In most applications we don't need to know that you were Judy Jones 2 years ago and are Judy Smith now (HR application are usually an exception).
I'd say start off just tracking the data you need and perform the calculations on the fly, but throughout the design process and well into the test/production of the software keep in mind that you may have to switch to storing the pre-calculated values at some point. Design with the ability to move to that model if the need arises.
Adding the pre-calculated values is one of those things that sounds good (because in many cases it is good) but might not be needed. Keep the design as simple as it needs to be. If performance becomes an issue in doing the calculations on the fly, then you can add fields to the database to store the calculations and run a batch overnight to catch up and fill in the legacy data.
As for the banking metaphor, definitely store a complete record of all transactions. Store any data that's relevant. A database should be a store of data, past and present. Audit trails, etc. The "current state" can either be calculated on the fly or it can be maintained in a flat table and re-calculated during writes to other tables (triggers are good for that sort of thing) if performance demands it.
It depends :) Persisting derived data in the database can be useful because it enables you to implement constraints and other logic against it. Also it can be indexed or you may be able to put the calculations in a view. In any case, try to stick to Boyce-Codd / 5th Normal Form as a guide for your database design. Contrary to what you may sometimes hear, normalization does not mean you cannot store derived data - it just means data shouldn't be derived from nonkey attributes in the same table.
Fundamentally any database is a record of the known facts at a particular point in time. Most databases include some time component and some data is preserved whereas some is not - requirements should dictate this.
You've answered your own question.
Any choices that you make depend on the requirements of the application.
Sometimes speed wins, sometimes space wins. Sometime data accuracy wins, sometimes snapshots win.
While you may not have the ability to tell what's important, the person you're solving the problem for should be able to answer that for you.
I like dynamic programming(not calculate anything twise). If you're not limited with space and are fine with a bit outdated data, then precalculate it and store in the DB. This will give you additional benefit of being able to run sanity checks and ensure that data is always consistent.
But as others already replied, it depends :)