For background: I was recently hired as a database engineer for a water treatment company. We deploy water treatment machines to sites across the country, and the machines treat water and send continuous data back to us regarding the state of incoming water (flow rate, temperature, concentration of X in incoming water, etc.), and regarding the treatments the machine applied to that water at that point in time. Over time, sites (and their various components) change a lot: a machine might break down and need to be replaced, a different concentration of chemical may be used to fill the machine's tanks, its flow meters and other sensors might be recalibrated or set to scale differently, its chemical pumps might be replaced, and on and on. These affect the interpretation of the data: for example, if 5 mL of chlorine was added to the incoming water at 01/01/2021 12:00:05, that means two completely different things if the chlorine was 5% concentrated or 40% concentrated.
Water treatment datapoints are identified by a composite key consisting of the ID of the site, and a timestamp. It would be easy if the only data that mattered was current data, as I could just store the configuration settings on a Site level and pull them up for datapoints as needed. But we need to be able to correctly interpret older data. So, I thought about storing configurations in another table, trackingall the settings for each site over each time period, but it's not possible to create a foreign key between the continuous timestamps of the datapoints and the start/end dates of the configurations - the closest thing would be some kind of range check, like "Datapoint.TimeStamp BETWEEN Configuration.Start AND Configuration.End". So the only other option I see is to store every configuration setting for every datapoint alongside each datapoint, but that seems like a terrible solution given how many configuration settings there are and how many datapoints are generated, especially since most of the settings don't even change often.
So, is there a way to store historical configurations for each row of data in a way that is at all normalized, or is the only possible solution to cram all the settings into each datapoint?
If I understood your request :
1 - a water datapoint is identified by a composite key consisting of the ID of the site, and a timestamp :
SiteID
TimeStampID
2 - a water datapoint can have multiple configurations when a break down happens for example :
ConfigurationID
StartDate
EndDate
Let's consider a DataPoint having the following information for a specific day :
DataPoint SiteID TimeStampID
1001 101 01-02-2021 09:00:01
1001 101 01-02-2021 10:20:31
1001 101 01-02-2021 17:45:00
At that day, a break down started at 11:01:20 and ended at 11:34:22.
ConfigurationID DataPoint StartDate EndDate
155 1001 01-02-2021 11:01:20 01-02-2021 11:34:22
The original answer that I accepted seems to have been deleted. For anyone coming here in the future, the solution that I intend to go with is as follows:
I'm going to create a configuration table to hold settings in the following format:
_SiteID_ _Start_ _End_ <various settings fields>
318 "2021-01-01 12:22:03" "2021-02-10 09:08:26" ...
Where the primary key is (SiteID, Start, End). SiteID is a foreign key to the integer ID of the Site table, Start is the date at which the configuration starts being valid, and End (default: NULL) is the date at which the configuration is no longer valid. In order to keep things good and simple for users (and myself), and to prevent any accidental updates to old configuration settings when instead there should have been a new configuration row inserted, I'm going to disallow UPDATE and DELETE operations on the configuration table for all users except root, and instead create a stored procedure for "updating" the configuration of a given Site. The stored procedure will take whatever new parameters the user specified, copy in any parameters that the user DIDN'T specify from the most recent configuration for that Site (i.e., the row with the same SiteID and the NULL End date), overwrite the most recent configuration row's NULL End date to be the Start date for the new row, and finally create the new row with the specified Start date.
NOTE: the Start date and End date are both stored for each configuration because configurations might not necessarily be continuous, i.e. it is not the case that "as soon as a configuration expired, there is another configuration that starts at the exact time that that configuration expired", as deployments of water treatment equipment sometimes have large gaps in between them if a client doesn't need our services for some period of time. Without storing the End dates for configurations too, we would have to assume that each configuration lasts until the next configuration begins, or until now, if there is no later configuration stored. So End date is stored so that we don't ever think "Site A was configured to have X Y Z settings from January 2020 to June 2021" when there hasn't even been a machine at Site A since May of 2020. Storing the End date explicitly alongside the Start date also avoids the ickiness of needing to rely on the values in other rows of configuration data to know how to interpret a given row of configuration data.
Thank you to whoever it was who originally gave me the inspiration for this answer, I have no idea why your answer was deleted.
This isn't exactly a programming question, as I don't have an issue writing the code, but a database design question. I need to create an app that tracks sales goals vs. actual sales over time. The thing is, that a persons goal can change (let's say daily at most).
Also, a location can have multiple agents with different goals that need to be added together for the location.
I've considered basically running a timed task to save daily goals per agent into a field. It seems that over the years that will be a lot of data, but it would let me simply query a date range and add all the daily goals up to get an goal for that date range.
Otherwise, I guess I could simply write changes (i.e. March 2nd - 15 sales / week, April 12th, 16 sales per week) which would be less data, but much more programming work to figure out goals based on a time query.
I'm assuming there is probably a best practice for this - anyone?
Put a date range on your goals. The start of the range is when you set that goal. The end of the range starts off as max-collating date (often 9999-12-31, depending on your database).
Treat this as "until forever" or "until further notice".
When you want to know what goals were in effect as of a particular date, you would have something like this in your WHERE clause:
...
WHERE effective_date <= #AsOfDate
AND expiry_date > #AsOfDate
...
When you change a goal, you need two operations, first you update the existing record (if it exists) and set the expiry_date to the new as-of date. Then you insert a new record with an effective_date of the new as-of date and an expiry_date of forever (e.g. '9999-12-31')
This give you the following benefits:
Minimum number of rows
No scheduled processes to take daily snapshots
Easy retrieval of effective records as of a point in time
Ready-made audit log of changes
Like our normal bank accounts we have a lot of transactions which result in inflow or outflow of money. The account balance can always be derived by simply summing up the transaction values. What would be better, storing the updated account balance in the database or re-calculating it whenever needed?
Expected transaction volume per account: <5 daily.
Expected retrieval of account balance: Whenever a transaction happens and once a day on an average otherwise.
Preface
There is an objective truth: Audit requirements. Additionally, when dealing with public funds, there is Legislature that must be complied with.
You don't have to implement the full accounting requirement, you can implement just the parts that you need.
Conversely, it would be ill-advised to implement something other than the standard accounting requirement (the parts thereof) because that guarantees that when the number of bugs or the load exceeds some threshold, or the system expands,you will have to re-implement. A cost that can, and therefore should, be avoided.
It also needs to be stated: do not hire an unqualified, un-accredited "auditor". There will be consequences, the same as if you hired an unqualified developer. It might be worse, if the Tax Office fines you.
Method
The Standard Accounting method in not-so-primitive countries is this. The "best practice", if you will, in others.
This method applies to any system that has similar operations; needs; historic monthly figures vs current-month requirements, such as Inventory Control, etc.
Consideration
First, the considerations.
Never duplicate data.
If the Current Balance can be derived (and here it is simple, as you note), do not duplicate it with a summary column.
Such a column is a duplication of data. It breaks Normalisation rules.
Further, it creates an Update Anomaly, which otherwise does not exist.
If you do use a summary column, when a new AccountTransaction is inserted, the summary column Current Balance value is rendered obsolete, therefore it must be updated all the time anyway. That is the consequence of the Update Anomaly. Which eliminates the value of having it.
External publication.
Separate point. If the balance is published, as in a monthly Bank Statement, such documents usually have legal restrictions and implications, thus that published Current Balance value must not change after publication.
Any change, after the publication date, in the database, of a figure that is published externally, is evidence of dishonest conduct, fraud, etc.
Such an act, attempting to change published history, is the hallmark of a novice. Novices and mental patients will insist that history can be changed. But as everyone should know, ignorance of the law does not constitute a valid defence.
You wouldn't want your bank, in Apr 2015, to change the Current Balance that they published in their Bank Statement to you of Dec 2014.
That figure has to be viewed as an Audit figure, published and unchangeable.
To correct an erroroneous AccountTransaction that was made in the past, that is being corrected in the present, the correction or adjustment that is necessary, is made as a new AccountTransaction in the current month (even though it applies to some previous month or duration).
This is because that applicable-to month is closed; Audited; and published, because one cannot change history after it has happened and it has been recorded. The only effective month is the current one.
For interest-bearing systems, etc, in not-so-primitive countries, when an error is found, and it has an historic effect (eg. you find out in Apr 2015 that the interest calculated monthly on a security has been incorrect, since Dec 2014), the value of the corrected interest payment/deduction is calculated today, for the number of days that were in error, and the sum is inserted as a AccountTransaction in the current month. Again, the only effective month is the current one.
And of course, the interest rate for the security has to be corrected as well, so that that error does not repeat.
The same principles apply to Inventory control systems. It maintains sanity.
All real accounting systems (ie. those that are accredited by the Audit Authority in the applicable country, as opposed to the mickey mouse "packages" that abound) use a Double Entry Accounting system for all AccountTransactions, precisely because it prevents a raft of errors, the most important of which is, funds do not get "lost". That requires a General Ledger and Double-Entry Accounting.
You have not asked for that, you do not need that, therefore I am not describing it here. But do remember it, in case money goes "missing", because that is what you will have to implement, not some band-aid solution; not yet another unaccredited "package".
This Answer services the Question that is asked, which is not Double-Entry Accounting.
For a full treatment of that subject (detailed data model; examples of accounting Transactions; rows affected; and SQL code examples), refer to this Q&A:
Relational Data Model for Double-Entry Accounting.
The major issues that affect performance are outside the scope of this question, but to furnish a short and determinant answer: it is dependent on:
Whether you implement a genuine Relational Database or not (eg. a 1960's Record Filing System, which is characterised by Record IDs, deployed in an SQL container for convenience).
whether you us a genuine SQL Platform (architected; stable; reliable; SQL-compliant; OLTP; etc) or the pretend-SQL freeware (herd of programs; ever-changing; no complaince; scales like a fish.
The use of genuine Relational Keys, etc, will maintain high performance, regardless of the population of the tables.
Conversely, an RFS will perform badly, they simply cannot perform. "Scale" when used in the context of an RFS, is a fraudulent term: it hides the cause and seeks to address everything but the cause. Most important, such systems have none of the Relational Integrity; the Relational Power; or the Relational Speed, of a Relational DBMS.
Implementation
Relational Data Model • Bank Account
Relational Data Model • Inventory
Notation
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993.
My IDEF1X Introduction is essential reading for those who are new to the Relational Model, or its modelling method. Note that IDEF1X models are rich in detail and precision, showing all required details, whereas home-grown models have far less than that. Which means, the notation has to be understood.
Content
For each AccountNo, there will be one AccountStatement row per per month, with a ClosingBalance; Statement Date (usually the first day of the next month) and other Statement details for the closed month.
This is not a "duplicate" or a derivable value that is stored because (a) the value applies to just one Date, (b) it is demanded for Audit and sanity purposes, and (c) provides a substantial performance benefit (elimination of the SUM( all transactions ) ).
For Inventory, for each PartCode, there will be one PartAudit row per month, with a QtyOnHand column.
It has an additional value, in that it constrains the scope of the Transaction rows required to be queried to the current month
Again, if your table is Relational and you have an SQL Platform, the Primary Key for AccountTransaction will be (AccountNo, Transaction DateTime) which will retrieve the Transactions at millisecond speeds.
Whereas for a Record Filing System, the "primary key" will be AccountTransactionID, and you will be retrieving the current month by Transaction Date, which may or may not be indexed correctly, and the rows required will be spread across the file. In any case at far less than ClusteredIndex speeds, and due to the spread, it will incur a tablescan.
The AccountTransaction table remains simple (the real world notion of a bank account Transaction is simple). It has a single positive Amount column.
For each Account, the CurrentBalance is:
the AccountStatement.ClosingBalance of the previous month, dated the first of the next month for convenience
for inventory, the PartAudit.QtyOnHand
plus the SUM( Transaction.Amounts ) in the current month, where the AccountTransactionType indicates a deposit
for inventory, the PartMovement.Quantity
minus the SUM( Transaction.Amount ) in the current month, where the AccountTransactionType indicates a withdrawal
(code provided below).
In this Method, the AccountTransactions in the current month, only, are in a state of flux, thus they must be retrieved. All previous months are published and closed, thus the Audit figure AccountStatement.ClosingBalancemust be used.
The older rows in the AccountTransaction table can be purged. Older than ten years for public money, five years otherwise, one year for hobby club systems.
Of course, it is essential that any code relating to accounting systems uses genuine OLTP Standards and genuine SQL ACID Transactions (not possible in the pretend-SQL freeware).
This design incorporates all scope-level performance considerations (if this is not obvious, please ask for expansion). Scaling inside the database is a non-issue, any scaling issues that remain are actually outside database.
Corrective Advice
These items need to be stated only because incorrect advice has been provided in many SO Answers (and up-voted by the masses, democratically, of course), and the internet is chock-full of incorrect advice (amateurs love to publish their subjective "truths"):
Evidently, some people do not understand that I have given a Method in technical terms, to operate against a clear data model. As such, it is not pseudo-code for a specific application in a specific country. The Method is for capable developers, it is not detailed enough for those who need to be lead by the hand.
They also do not understand that the cut-off period of a month is an example: if your cut-off for Tax Office purposes is quarterly, then by all means, use a quarterly cut-off; if the only legal requirement you have is annual, use annual.
Even if your cut-off is quarterly for external or compliance purposes, the company may well choose a monthly cut-off, for internal Audit and sanity purposes (ie. to keep the length of the period of the state of flux to a minimum).
Eg. in Australia, the Tax Office cut-off for businesses is quarterly, but larger companies cut-off their inventory control monthly (this saves having to chase errors over a long period).
Eg. banks have legal compliance requirements monthly, therefore they perform an internal Audit on the figures, and close the books, monthly.
In primitive countries and rogue states, banks keep their state-of-flux period at the maximum, for obvious nefarious purposes. Some of them only make their compliance reports annually. That is one reason why the banks in Australia do not fail.
In the AccountTransaction table, do not use negative/positive in the Amount column. Money always has a positive value, there is no such thing as negative twenty dollars (or that you owe me minus fifty dollars), and then working out that the double negatives mean something else.
The movement direction, or what you are going to do with the funds, is a separate and discrete fact (to the AccountTransaction.Amount). Which requires a separate column (two facts in one datum breaks Normalisation rules, with the consequence that it introduces complexity into the code).
Implement a AccountTransactionType reference table, the Primary Key of which is ( D, W ) for Deposit/Withdrawal as your starting point. As the system grows, simply add ( A, a, F, w ) for Adjustment Credit; Adjustment Debit; Bank Fee; ATM_Withdrawal; etc.
No code changes required.
In some primitive countries, litigation requirements state that in any report that lists Transactions, a running total must be shown on every line. (Note, this is not an Audit requirement because those are superior [(refer Method above) to the court requirement; Auditors are somewhat less stupid than lawyers; etc.)
Obviously, I would not argue with a court requirement. The problem is that primitive coders translate that into: oh, oh, we must implement a AccountTransaction.CurrentBalance column. They fail to understand that:
the requirement to print a column on a report is not a dictate to store a value in the database
a running total of any kind is a derived value, and it is easily coded (post a question if it isn't easy for you). Just implement the required code in the report.
implementing the running total eg. AccountTransaction.CurrentBalance as a column causes horrendous problems:
introduces a duplicated column, because it is derivable. Breaks Normalisation. Introduces an Update Anomaly.
the Update Anomaly: whenever a Transaction is inserted historically, or a AccountTransaction.Amount is changed, all the AccountTransaction.CurrentBalances from that date to the present have to be re-computed and updated.
in the above case, the report that was filed for court use, is now obsolete (every report of online data is obsolete the moment it is printed). Ie. print; review; change the Transaction; re-print; re-review, until you are happy. It is meaningless in any case.
which is why, in less-primitive countries, the courts do not accept any old printed paper, they accept only published figures, eg. Bank Statements, which are already subject to Audit requirements (refer the Method above), and which cannot be recalled or changed and re-printed.
Comments
Alex:
yes code would be nice to look at, thank you. Even maybe a sample "bucket shop" so people could see the starting schema once and forever, would make world much better.
For the data model above.
Code • Report Current Balance
SELECT AccountNo,
ClosingDate = DATEADD( DD, -1 Date ), -- show last day of previous
ClosingBalance,
CurrentBalance = ClosingBalance + (
SELECT SUM( Amount )
FROM AccountTransaction
WHERE AccountNo = #AccountNo
AND TransactionTypeCode IN ( "A", "D" )
AND DateTime >= CONVERT( CHAR(6), GETDATE(), 2 ) + "01"
) - (
SELECT SUM( Amount )
FROM AccountTransaction
WHERE AccountNo = #AccountNo
AND TransactionTypeCode NOT IN ( "A", "D" )
AND DateTime >= CONVERT( CHAR(6), GETDATE(), 2 ) + "01"
)
FROM AccountStatement
WHERE AccountNo = #AccountNo
AND Date = CONVERT( CHAR(6), GETDATE(), 2 ) + "01"
By denormalising that transactions log I trade normal form for more convenient queries and less changes in views/materialised views when I add more tx types
God help me.
When you go against Standards, you place yourself in a third-world position, where things that are not supposed to break, that never break in first-world countries, break.
It is probably not a good idea to seek the right answer from an authority, and then argue against it, or argue for your sub-standard method.
Denormalising (here) causes an Update Anomaly, the duplicated column, that can be derived from TransactionTypeCode. You want ease of coding, but you are willing to code it in two places, rather than one. That is exactly the kind of code that is prone to errors.
A database that is fully Normalised according to Dr E F Codd's Relational Model provides for the easiest, the most logical, straight-forward code. (In my work, I contractually guarantee every report can be serviced by a single SELECT.)
ENUM is not SQL. (The freeware NONsql suites have no SQL compliance, but they do have extras which are not required in SQL.) If ever your app graduates to a commercial SQL platform, you will have to re-write all those ENUMs as ordinary LookUp tables. With a CHAR(1) or a INT as the PK. Then you will appreciate that it is actually a table with a PK.
An error has a value of zero (it also has negative consequences). A truth has a value of one. I would not trade a one for a zero. Therefore it is not a trade-off. It is just your development decision.
This is fairly subjective. The things I'd suggest taking into account are:
How many accounts are there, currently?
How many accounts do you expect to have, in the future?
How much value do you place upon scalability?
How difficult is it to update your database and code to track the balance as its own field?
Are there more immediate development concerns that must be attended to?
In terms of the merits of the two approaches proposed, summing the transaction values on-demand is likely to be the easier/quicker to implement approach.
However, it won't scale as well as maintaining the current account balance as a field in the database and updating it as you go. And it increases your overall transaction processing time somewhat, as each transaction needs to run a query to compute the current account balance before it can proceed. In practice those may be small concerns unless you have a very large number of accounts/transactions or expect to in the very near future.
The downside of the second approach is that it's probably going to take more development time/effort to set up initially, and may require that you give some thought to how you synchronize transactions within an account to ensure that each one sees and updates the balance accurately at all times.
So it mostly comes down to what the project's needs are, where development time is best spent at the moment, and whether it's worth future-proofing the solution now as opposed to implementing the second approach later on, when performance and scalability become real, rather than theoretical, problems.
tl;dr general question about handling database data and design:
Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?
My specific scenario:
We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.
One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".
We could do this two ways I see:
A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.
We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.
I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"
One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.
Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).
As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.
In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.
As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like
SELECT StaffID, sum(Vacation)
from
(
SELECT StaffID, Sum(VacationAllocated) as Vacation
from Allocations
where AllocationDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
union
SELECT StaffID, -Count(distinct HolidayDate)
from HolidayTaken
where HolidayDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
) totals
group by StaffID
Derived data seems to me like a transitive dependency, which is avoided in normalisation.
That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.
Is it better to keep Days of month, Months, Year, Day of week and week of year as separate reference tables or in a common Answer table? Goal is allow user content searches and action analytic to be filtered by all the various date-time values (There will be custom reporting for users based on their shared content). I am trying to ensure data accuracy by using IDs, and also report out on numbers of shares, etc by time and date for system reporting by comparing various user groups. If we keep in separate tables, what about time? A table with each hour, minute and second also needed?
Most databases support some sort of TIMESTAMP data type plus assciated DAY(), MONTH(), DAYOFWEEK() functions.
The only valid reason for separate DAY or HOUR columns in a separate table is if you have procomputed totals and averages for each timeslot.
Even then its only worth it if you expect a lot of filtering based on these values, as the cost of building these tables is high, and, for most queries the standard SQL "GROUP BY ... HAVING .. " will perform well enough.
It sounds like you may be interested in a "STAR SCHEMA" wikipedia a common method in data warehosing to speed up queries -- but be warned designing and building a Star Schem is not a trivial exercise.