I store readings in a database from sensors for a Temperature monitoring system.
There's 2 types of reading: air and product. The product temperature is represents the slow temperature change of an item of food versus the actual air temperature.
They 2 temperatures are taken from different sensors (different locations within the environment, usually a large controlled environment) so they are not related (i.e. I cannot derive the product temperature from the air temperature).
Initially the product temperature I was provided with was already damped by the sensor, however whoever wrote the firmware made a mistake so the damped value is incorrect, and now I instead have to take the un-damped reading from the product sensor and apply the damping myself based on the last few readings in the database.
When a new reading comes in, I look at the last few undamped readings, and the last damped reading, and determine a new damped reading from that.
My question is: Should I store this calculated reading as well as the undamped reading, or should I calculate it in a view leaving all physically stored readings undamped?
One thing that might influence this: The readings are critical; alarms rows are generated against the readings when they go out of tolerance: it is to prevent food poisoning and people can lose there jobs over it. People sign off the values they see, so those values must never change..
Normally I would use a view and put the calculation in the view, but I'm a little nervous about doing that this time. If the calculation gets "tweaked" I then have to make the view more complicated to use the old calculation before a certain timestamp, etc. (which is fine; I just have to be careful wherever I query the reading values - I don't like nesting views in other views as sometimes it can slow the query..).
What would you do in this case?
Thanks!
The underlying idea from the relational model is "logical data independence". Among other things, SQL views implement logical data independence.
So you can start by putting the calculation in a view. Later, when it becomes too complex to maintain that way, you can move the calculation to a SQL function or SQL stored procedure, or you can move the calculation to application code. You can store the results in a base table if you want to. Then update the view definition.
The view's clients should continue to work as if nothing had changed.
Here's one problem with storing this calculated value in a base table: you probably can't write a CHECK constraint to guarantee it was calculated correctly. This is a problem regardless of whether you display the value in a view. That means you might need some kind of administrative procedure to periodically validate the data.
Related
I want to keep track of a sum of user-controlled variables.
Each user can add/remove/update his/her own variables.
Users should be able to see the sum change after their own update.
As the number of users scales up, the system will be distributed and updates will happen concurrently.
I want to avoid a bottleneck for updating the sum.
What is the best way to keep track of this sum?
Is a exist database that can handle this or do I need implement something myself?
So generally, if you know by how much each value has changed you know how the sum has changed and you can use these incremental changes to update the sum.
In a centralised was you could for instance use any SQL Database which supports triggers and transactions. You'd have a table with all the description of different clients / numbers and their values and another table to cache the sum. The idea is that the trigger would run on update / delete / insert und just update the cached sum. This way it would be much faster for huge amounts of data, but also much more error prone (you can also just re-sum up all the values in the trigger and store it in a cache, and it would work for few thousand values easily)
In a decentralised system you can do something similar. Here you can either share all the values between all the clients, or (as this might be too much) just the change. So every client is responsible for some values and on every change he'd share that the total sum has changed by the change. - example: if the user modifies a value from 5 to 3 the client will broadcast -2. you assume an initial state of 0 and just sum up all the numbers from the clients as they come in. The order doesn't matter due to the commutative property of the addition operation. You only need to make sure that everyone will receive the data, but you can achieve this via reliable multicast.
Which one is best, regarding the implementation of a database for a web application: a lean and very small database with only the bare information, sided with a application that "recalculates" all the secondary information, on demand, based on those basic ones, OR, a database filled with all those secondary information already previously calculated, but possibly outdated?
Obviously, there is a trade-of there and I think that anyone would say that the best answer to this question is: "depends" or "is a mix between the two". But I'm really not to comfortable or experienced enough to reason alone about this subject. Could someone share some thoughts?
Also, another different question:
Should a database be the "snapshot" of a particular moment in time or should a database accumulate all the information from previous time, allowing the retrace of what happened? For instance, let's say that I'm modeling a Bank Account. Should I only keep the one's balance on that day, or should I keep all the one's transactions, and from those transactions infer the balance?
Any pointer on this kind of stuff that is, somehow, more deep in database design?
Thanks
My quick answer would be to store everything in the database. The cost of storage is far lower than the cost of processing when talking about very large scale applications. On small scale applications, the data would be far less, so storage would still be an appropriate solution.
Most RDMSes are extremely good at handling vast amounts of data, so when there are millions/trillions of records, the data can still be extracted relatively quickly, which can't be said about processing the data manually each time.
If you choose to calculate data rather than store it, the processing time doesn't increase at the same rate as the size of data does - the more data ~ the more users. This would generally mean that processing times would multiply by the data's size and the number of users.
processing_time = data_size * num_users
To answer your other question, I think it would be best practice to introduce a "snapshot" of a particular moment only when data amounts to such a high value that processing time will be significant.
When calculating large sums, such as bank balances, it would be good practice to store the result of any heavy calculations, along with their date stamp, to the database. This would simply mean that they will not need calculating again until it becomes out of date.
There is no reason to ever have out of date pre-calulated values. That's what trigger are for (among other things). However for most applications, I would not start precalculating until you need to. It may be that the calculation speed is always there. Now in a banking application, where you need to pre-calculate from thousands or even millions of records almost immediately, yes, design a precalulation process bases on triggers that adjust the values every time they are changed.
As to whether to store just a picture in time or historical values, that depends largely on what you are storing. If it has anything to do with financial data, store the history. You will need it when you are audited. Incidentally, design to store some data as of the date of the action (this is not denormalization). For instance, you have an order, do not rely onthe customer address table or the product table to get data about where the prodcts were shipped to or what they cost at the time of the order. This data changes over time and then you orders are no longer accurate. You don't want your financial reports to change the dollar amount sold because the price changed 6 months later.
There are other things that may not need to be stored historically. In most applications we don't need to know that you were Judy Jones 2 years ago and are Judy Smith now (HR application are usually an exception).
I'd say start off just tracking the data you need and perform the calculations on the fly, but throughout the design process and well into the test/production of the software keep in mind that you may have to switch to storing the pre-calculated values at some point. Design with the ability to move to that model if the need arises.
Adding the pre-calculated values is one of those things that sounds good (because in many cases it is good) but might not be needed. Keep the design as simple as it needs to be. If performance becomes an issue in doing the calculations on the fly, then you can add fields to the database to store the calculations and run a batch overnight to catch up and fill in the legacy data.
As for the banking metaphor, definitely store a complete record of all transactions. Store any data that's relevant. A database should be a store of data, past and present. Audit trails, etc. The "current state" can either be calculated on the fly or it can be maintained in a flat table and re-calculated during writes to other tables (triggers are good for that sort of thing) if performance demands it.
It depends :) Persisting derived data in the database can be useful because it enables you to implement constraints and other logic against it. Also it can be indexed or you may be able to put the calculations in a view. In any case, try to stick to Boyce-Codd / 5th Normal Form as a guide for your database design. Contrary to what you may sometimes hear, normalization does not mean you cannot store derived data - it just means data shouldn't be derived from nonkey attributes in the same table.
Fundamentally any database is a record of the known facts at a particular point in time. Most databases include some time component and some data is preserved whereas some is not - requirements should dictate this.
You've answered your own question.
Any choices that you make depend on the requirements of the application.
Sometimes speed wins, sometimes space wins. Sometime data accuracy wins, sometimes snapshots win.
While you may not have the ability to tell what's important, the person you're solving the problem for should be able to answer that for you.
I like dynamic programming(not calculate anything twise). If you're not limited with space and are fine with a bit outdated data, then precalculate it and store in the DB. This will give you additional benefit of being able to run sanity checks and ensure that data is always consistent.
But as others already replied, it depends :)
I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.
I'm designing an application that receives information from roughly 100k sensors that measure time-series data. Each sensor measures a single integer data point once every 15 minutes, saves a log of these values, and sends that log to my application once every 4 hours. My application should maintain about 5 years of historical data. The packet I receive once every 4 hours is of the following structure:
Data and time of the sequence start
Number of samples to arrive (assume this is fixed for the sake of simplicity, although in practice there may be partials)
The sequence of samples, each of exactly 4 bytes
My application's main usage scenario is showing graphs of composite signals at certain dates. When I say "composite" signals I mean that for example I need to show the result of adding Sensor A's signal to Sensor B's signal and subtracting Sensor C's signal.
My dilemma is how to store this time-series data in my database. I see two options, assuming I use a relational database:
Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
Store every 4-hour signal as a separate row with its starting time. In this case, whenever a signal arrives, I just add it as a BLOB to the database.
There are obvious pros and cons for each of the options, including storage size, performance, and complexity of the code "above" the database.
I wondered if there are best practices for such cases.
Many thanks.
Storing each sample in it's own row sounds simple and logical to me. Don't be too hasty to optimize unless there is actually a good reason for it. Maybe you should do some tests with dummy data to see if any optimization is really necessary.
I think storing the data in the form that makes it easiest to carry out your main goal is likely the least painful overall. In this case, it's likely the more efficient as well.
Since your main goal appears to be to display the information in interesting and flexible ways I'd go with separate rows for each data point. I presume most of the effort required to write this program well is likely on the display side, you should minimize the complexity on that side as much as possible.
Storing data in BLOBs is good if the content isn't relevent and you would never want to run queries against it. In this case, your data will be the contents of the database, and therefore, very relevent.
I think you should:
1.Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
I see two database operations here: the first is to store the data as it comes in, and the second is to retrieve the data in a (potentially large) number of ways.
As Kieveli says, since you'll be using discrete parts of the data (as opposed to all of the data all at once), storing it as a blob won't help you when it comes time to read it. So for the first task, storing the data line by line would be optimal.
This might also be "good enough" when querying the data. However, if performance is an issue, and/or if you get massive amounts of volume [100,000 sensors x 1 per 15 minutes x 4 hours = 9,600,000 rows per day, x 5 years = 17,529,600,000 or so rows in five years]. To my mind, if you want to write flexible queries against that kind of data, you'll want some form of star schema structure (as gets used in data warehouses).
Whether you load the data directly into the warehouse, or let it build up "row by row" to be added to the warehouse ever day/week/month/whatever, depends on time, effort, available resources, and so on.
A final suggestion: when you set up a test environment for your new code, load it with several years of (dummy) data, to see how it will perform.
Let's say I'm getting a large (2 million rows?) amount of data that's supposed to be static and unchanging. Supposed to be. And this data gets republished monthly. What methods are available to 1) be aware of what data points have changed from month to month and 2) consume the data given a point in time?
Solution 1) Naively save every snapshot of data, annotated by date. Diff awareness is handled by some in-house program, but consumption of the data by date is trivial. Cons, space requirements balloon by an order of magnitude.
Solution 2A) Using an in-house program, track when the diffs happen and store them in an EAV table, annotated by date. Space requirements are low, but consumption integrated with the original data becomes unwieldly.
Solution 2B) Using an in-house program, track when the diffs happen and store them in a sparsely filled table that looks much like the original table, filled only with the data that's changed and the date when changed. Cons, model is sparse and consumption integrated with the original data is non-trivial.
I guess, basically, how do I integrate the dimension of time into a relational database, keeping in mind both the viewing of the data and awareness of differences between time periods?
Does this relate to data warehousing at all?
Smells like... Slowly changing dimension?
I had a similar problem - big flat files imported to the database once per day. Most of the data is unchanging.
Add two extra columns to the table, starting_date and ending_date. The default value for ending_date should be sometime in the future.
To compare one file to the next, sort them both by the key columns, then read one row from each file.
If the keys are equal: compare the rest of the columns to see if the data has changed. If the row data is equal, the row is already in the database and there's nothing to do; if it's different, update the existing row in the database with an ending_date of today and insert a new row with a starting_date of today. Read a new row from both files.
If the key from the old file is smaller: the row was deleted. Update ending_date to today. Read a new row from the old file.
If the key from the new file is smaller: a row was inserted. Insert the row into the database with a starting_date of today. Read a new row from the new file.
Repeat until you've read everything from both files.
Now to query for the rows that were valid at any date, just select with a where clause test_date between start_date and end_date.
You could also take a leaf from the datawarehousing book. There are basically three ways of of dealing with changing data.
Have a look at this wikipedia article for SCD's but it is in essence tables:
http://en.wikipedia.org/wiki/Slowly_changing_dimension
A lot of this depends on how you're storing the data. There are two factors to consider:
How oftne does the data change?
How much does the data change?
The distinction is important. If it changes often but not much then annotated snapshots are going to be extremely inefficient. If it changes infrequently but a lot then they're a better solution.
It also depends on if you need to see what the data looked like at a specific point in time.
If you're using Oracle, for example, you can use flashback queries to see a consistent view of the data at some arbitrary point.
Personally I think you're better off storing it incrementally and, at a minimum, using some form of auditing to track changes so you can recover an historic snapshot if it's ever required. But like I said, this depends on many factors.
If it was me, I'd save the whole thing every month (not necessarily in a database, but as a data file or text file off-line) - you will be glad you did. Even at a row size of 4096 bytes (wild ass guess), you are only talking about 8G of disk per month. You can save a LOT of months on a 300G drive. I did something similar for years, when I was getting over 1G per day in downloads to a datawarehouse.
This sounds to me rather like the problem faced by source code version control systems. These store patches which are used to create the changes as they occur. So if a file does not change, or only a few lines change, the patch that needs to be stored is relatively very small. The system also stores which version each patch contributes to. So, when viewing a particular version of a particular file, the initial version is recovered and all the patches, up to the version requested are applied.
In your, very general, situation, you need to divide up your data into chunks. Hopefully there are natural divisions you can use, but if this division has to be arbitrary that's should be OK. Whenever a change occurs, store the patch for the affected chunk and record a new version. Now, when you want to view a particular date, find the last version that predates the view date, apply the patches for the chunk that has been requested, and display.
Could you do the following:
1. Each month BCP all data into a temporary table
2. Run a script or stored procedure to update the primary table
(which has an additional DateTime column as part of a composite key),
with any changes made.
3. Repeat each month.
This should give you a table, which you can query data for at a particular date.
In addition each change will be logged, and the size of the table shouldn't change dramatically over time.
However, as a backup to this, I would store each data file as Brennan suggests.