What is the life span of data? - database

Recently I’ve found myself in a database tangle where management wants the ability to remove data from the database, but still wants that data to appear in other places. Example: They want to remove all instances of the product whizbang, but they still want whizbang to appear in sales reports. (if they ran one for a previous date).
Now I can add a field, say is_deleted, that will track whether that product has been deleted and thus still keep all my references, but over a period of time, I have the potential of housing a lot of dead data. (data that is never accessed again). How to handle this is not my question.
I’m curious to find out, in your experience what is the average life span of data? That is, on average how long is data alive or good for before it gets either replaced or deleted? I understand that this is relative to the type of data you are housing, but certainly all data has some sort of life span?

Data lives forever...or often it should. One common practice is to have end and/or start dates for a record. So for your whizbang, you have a start date (so that it won't appear on sales reports before it's official launch), and an end date (so that it drops off of reports after it's been end-of-lifed). Using the proper dates as criteria for your reporting as well as your applications, you won't see the whizbang except for when you should, and the data still exists (which it should, theoretically infinitely).
As Koistya Navin mentions, moving data to a data warehouse at a certain point is also an option, but this depends in large part on how large your 'old' data is, and how long you need to keep it readily available for access.

Many of our customers keep data online for 2 years. After that it's moved to backup disks, but it can be put online if needed.
Consider adding a column "expiration" or "effective date". This will allow you mark a product as obsolete, but reports will return that product if the time range is satisfied.

Usually it's better to move such data into seporate database (database warehouse) and keep working database clean. At data warehouse your data can be kept for many years without impacting your application.
Reference: Data Warehouse at Wikipedia

I've always gone by what is the ruling body looking for. Example the IRS wants you to keep 7 years of history or for security reasons we keep 3 years of log information, etc. So I guess you could do 2 things, determine what the life span of your data is I would say 3 years would be enough and then you could add the is_deleted flag along with a date that way you would be able to flag some data to delete sooner than later.

Yes, all data has a lifespan. And yes, it is relative to the type of data you have.
Some data has a lifespan measured in seconds (authentication tokens, for instance), some other data virtual eternity (more than the medium and formats it is stored into, like for instance ownership records).
You will have to either be more specific as to the type of data you are envisioning, or do a census in your own organization as to the usual lifespan of stuff.

Our particular flavor varies. We have some data (a vast majority) which goes stale after 3 months (hard product limit) but can be revived at any later date.
We have other data that is effectively immortal.
In practice, most of the data we serve up is fresh and frequently requested for a few weeks, at most a month, before falling to sporadic use.

How much is "a lot of dead data"?
With processing power and data storage so cheap, I wouldn't purge old data unless there's a really good reason to. You also need to consider the legal implications. Large (and even small) companies may have incredibly long retention policies for old data, to save themselves millions down the road when they are subpoenaed for it by a judge.
I would check with whatever legal department you have and find out how long the data needs to be stored. That's the safest bet.
Also, ask yourself what the benefit of removing the old data is. Is the only benefit a tidier database? If so, I wouldn't do it. Are you going to see a 10X performance increase? If so, I'd do it. This really is a complex question though, and it's tough for us to have all the information required to give you good advice.

I have a few projects where the customer wants all the historical data (going back over 19 years). Quite a bit of the really old data is malformed and is going to be a nightmare to import into the new system. We convinced them that they won't need records going back any further than 10 years, but like you said it's all relative to the type of data you're housing.
On a side note, data storage is extremely cheap right now, and if it isn't affecting the performance of your application, I would just leave it where it is.

[...] but certainly all data has some sort of life span?
Not any kind of life span we can talk about meaningfully. A lot of data is useless as soon as it's created or recorded. Such data could be discarded immediately with no effect. On the other hand, some data has enough value that it will outlive the current system that hosts it. If Amazon were to completely replace their current infrastructure, the customer histories they have stored would still be immensely valuable.
As you said, it's relative. Each type of data has its own life span that has no relation to another type of data's life span. There's no meaningful "average life span of data".

I have the potential of housing a lot of dead data. (data that is never accessed again).
But they will when they perform those reports then they are accessing that data.
Until then you'll need to keep the data in some form. Move to another table or have a switch like you mentioned.

uh...at the risk of oversimplifying...it sounds like using DateDeleted instead of a bit would solve your how-long-to-keep issue.

Related

NOSQL denormalization datamodel

Many times I read that data in NOSQL databases is stored denormalized. For instance consider a chess game record. It may not only contain the player id's that participate in the chess game, but also the first and lastname of that player. I suppose this is done because joins are not possible in NOSQL, so if you just duplicate data you can still retrieve all the data you want in one call without manual application level processing of the data.
What I don't understand is that now when you want to update a chess-player's name, you will have to write a query that updates both the chess-game records in which that player participates as well as the player record of that player. This seems like a huge performance overhead as the database will have to search all games where that player participates in and then update each of those records.
Is it true that data is often stored denormalized like in my example?
You are correct, the data is often stored de-normalized in NoSQL databases.
The problem with the updates is partially where the term "eventual consistency" comes from.
In your example, when you update the player's name (not a common event, but it can happen), you would issue a background job to update the name across all other records. Yes, while the update is happening you may retrieve an older value, but eventually the data will be consistent. Since we're not writing ATM software here, the performance/consistency tradeoff is acceptable.
You can find more info here: http://www.allbuttonspressed.com/blog/django/2010/09/JOINs-via-denormalization-for-NoSQL-coders-Part-2-Materialized-views
One way to look at it is that the number of times the user changes his/her name is extremely rare.
But the number of times that board data is read and changed is immense.
So it only makes sense to optimize for a case that will happen so much more times than a case that's only happening ever so rarely.
Another point to note is that by not keeping that name data duplicated under board data, you are actually increasing the performance overhead of the read. Every time you fetch the board data, you'd have to go one more step ahead and fetch all the user data too (even if all you really wanted was just first and last name).
Again the reason to put that first name and last name on board data is probably that on the screen where the board data will be shown, you'll often be showing the user's name too.
For these reasons, you are spared to have duplicate data on NoSQL DBs. (Although this can be done in SQL DBs too but mind ya, you'll be frowned upon). Duplication in NoSQL world is fairly common and is promoted too.
I have been working for the past 7 years with NoSQL (Firestore) for 2 fairly big projects where I was able to write code from scratch (both around 50k LoC and one has about 15k daily active users). I didn't use denormalization at all. The concept never appealed to me, and document reads are fairly cheap in Firestore.
To come back to your example; loading the other data for the chess game seems way more important than instantly being able to show the name. I would load the name based on the user id in the background and put a simple client-side memoize / cache around it to prevent fetching the same user document over and over.
What I did use quite a bit to solve performance issues is generate derived data. I would set a listener on a database document "onWrite" and then store some computed data in another derived document. These documents would automatically update when the source changes, so it doesn't complicate things really. In the case of a chess game, a distilled document could be the leaderboard that is constantly shown to all users of the app.
Another optimization I had to do was to distill a long list of titles + metadata for recently opened "projects". Firestore on the web client side doesn't give the ability to select fields from a document in a query. It only fetches full documents and that was too much data for the list, so we solved this by making an API endpoint to fetch the distilled data through there.
I'm not saying you should follow my advice, but we seem to be doing well in terms of code complexity and database costs. So when I read that NoSQL requires data denormalization I become skeptical :)
That's my 2 cents.

Big Data Database

I am collecting a large amount of data which is most likely going to be a format as follows:
User 1: (a,o,x,y,z,t,h,u)
Where all the variables dynamically change with respect to time, except u - this is used to store the user name. What I am trying to understand since my background is not very intense in "big data", is when I end up with my array, it will be very large, something like 108000 x 3500, since I will be preforming analysis on each timestep, and graphing it, what would be an appropriate database to manage this in is what I am trying to determine. Since this is for scientific research I was looking at CDF and HDF5, and based on what I read here NASA I think I will want to use CDF. But is this the correct way to manage such data for speed and efficiency?
The final data set will have all the users as columns, and the rows will be timestamped, so my analysis program would read row by row to interpret the data. And make entries into the dataset. Maybe I should be looking at things like CouchDB and RDBMS, I just don't know a good place to start. Advice would be appreciated.
This is an extended comment rather than a comprehensive answer ...
With respect, a dataset of size 108000*3500 doesn't really qualify as big data these days, not unless you've omitted a unit such as GB. If it's just 108000*3500 bytes, that's only 3GB plus change. Any of the technologies you mention will cope with that with ease. I think you ought to make your choice on the basis of which approach will speed your development rather than speeding your execution.
But if you want further suggestions to consider, I suggest:
SciDB
Rasdaman, and
Monet DB
all of which have some traction in the academic big data community and are beginning to be used outside that community too.
I have been using CDF for some similarly sized data and I think it should work nicely. You will need to keep a few things in mind though. Considering I don't really know the details of your project, this may or may not be helpful...
3GB of data is right around the file size limit for the older version of CDF, so make sure you are using an up-to-date library.
While 3GB isn't that much data, depending on how you read and write it, things may be slow going. Make sure you use the hyper read/write functions whenever possible.
CDF supports meta-data (called global/variable attributes) that can hold information such as username and data descriptions.
It is easy to break data up into multiple files. I would recommend using one file per user. This will mean that you can write the user name just once for the whole file as an attribute, rather than in each record.
You will need to create an extra variable called epoch. This is well defined timestamp for each record. I am not sure if the time stamp you have now would be appropriate, or if you will need to process it some, but it is something you need to think about. Also, the epoch variable needs to have a specific type assigned to it (epoch, epoch16, or TT2000). TT2000 is the most recent version which gives nanosecond precision and handles leap seconds, but most CDF readers that I have run into don't handle it well yet. If you don't need that kind of precision, I recommend epoch16 as that has been the standard for a while.
Hope this helps, if you go with CDF, feel free to bug me with any issues you hit.

Collecting data which wasn't predicted when the system was designed

How do you go about collecting and storing data which was not part of the initial database and software design? For example, if you've come up with a pointing system, you have to collect the points for every user which has already been registered. For new users, that would be easy, because the changes of the business logic will reflect the pointing system ... but the old ones?
In general, how does one deal with data, which should have been there from the beginning, but wasn't? Writing manual queries to collect the missing pieces? Using crons?
Well, you are asking for something that is by definition not possible, I think.
deal with data hich should have been there from the beginning, but wasn't?
Because if you are able to deduce the number of points from the existing data in the database. If that were possible, there is obviously no missing data.... Storing the points separately would make it redundant (still a fine option in case you need that for performance).
For example: stackoverflow rewards number of consecutive visits. Let's say they did not do that from the start. If they were logging date-of-visit already, you can recalc the points. So no missing data.
So if that is not possible, you need another solution: either get data from other sources (parse a webserver log) or get the business to draft some extra business rules for the determination of the default values for the existing users (difficult in this particular example).
Writing manual queries to collect the missing pieces? Using crons?
I would populate that in a conversion script or even in a special conversion application if very complex.

Designing a database with periodic sensor data

I'm designing a PostgreSQL database that takes in readings from many sensor sources. I've done a lot of research into the design and I'm looking for some fresh input to help get me out of a rut here.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
The basic structure of the data coming in is as follows:
For each data logging device, there are several channels.
For each channel, the logger reads data and attaches it to a record with a timestamp
Different channels may have different data types, but generally a float4 will suffice.
Users should (through database functions) be able to add different value types, but this concern is secondary.
Loggers and channels will also be added through functions.
The distinguishing characteristic of this data layout is that I've got many channels associating data points to a single record with a timestamp and index number.
Now, to describe the data volume and common access patterns:
Data will be coming in for about 5 loggers, each with 48 channels, for every minute.
The total data volume in this case will be 345,600 readings per day, 126 million per year, and this data needs to be continually read for the next 10 years at least.
More loggers & channels will be added in the future, possibly from physically different types of devices but hopefully with similar storage representation.
Common access will include querying similar channel types across all loggers and joining across logger timestamps. For example, get channel1 from logger1, channel4 from logger2, and do a full outer join on logger1.time = logger2.time.
I should also mention that each logger timestamp is something that is subject to change due to time adjustment, and will be described in a different table showing the server's time reading, the logger's time reading, transmission latency, clock adjustment, and resulting adjusted clock value. This will happen for a set of logger records/timestamps depending on retrieval. This is my motivation for RecordTable below but otherwise isn't of much concern for now as long as I can reference a (logger, time, record) row from somewhere that will change the timestamps for associated data.
I have considered quite a few schema options, the most simple resembling a hybrid EAV approach where the table itself describes the attribute, since most attributes will just be a real value called "value". Here's a basic layout:
RecordTable DataValueTable
---------- --------------
[PK] id <-- [FK] record_id
[FK] logger_id [FK] channel_id
record_number value
logger_time
Considering that logger_id, record_number, and logger_time are unique, I suppose I am making use of surrogate keys here but hopefully my justification of saving space is meaningful here. I have also considered adding a PK id to DataValueTable (rather than the PK being record_id and channel_id) in order to reference data values from other tables, but I am trying to resist the urge to make this model "too flexible" for now. I do, however, want to start getting data flowing soon and not have to change this part when extra features or differently-structured-data need to be added later.
At first, I was creating record tables for each logger and then value tables for each channel and describing them elsewhere (in one place), with views to connect them all, but that just felt "wrong" because I was repeating the same thing so many times. I guess I'm trying to find a happy medium between too many tables and too many rows, but partitioning the bigger data (DataValueTable) seems strange because I'd most likely be partitioning on channel_id, so each partition would have the same value for every row. Also, partitioning in that regard would require a bit of work in re-defining the check conditions in the main table every time a channel is added. Partitioning by date is only applicable to the RecordTable, which isn't really necessary considering how relatively small it will be (7200 rows per day with the 5 loggers).
I also considered using the above with partial indexes on channel_id since DataValueTable will grow very large but the set of channel ids will remain small-ish, but I am really not certain that this will scale well after many years. I have done some basic testing with mock data and the performance is only so-so, and I want it to remain exceptional as data volume grows. Also, some express concern with vacuuming and analyzing a large table, and dealing with a large number of indexes (up to 250 in this case).
On a very small side note, I will also be tracking changes to this data and allowing for annotations (e.g. a bird crapped on the sensor, so these values were adjusted/marked etc), so keep that in the back of your mind when considering the design here but it is a separate concern for now.
Some background on my experience/technical level, if it helps to see where I'm coming from: I am a CS PhD student, and I work with data/databases on a regular basis as part of my research. However, my practical experience in designing a robust database for clients (this is part of a business) that has exceptional longevity and flexible data representation is somewhat limited. I think my main problem now is I am considering all the angles of approach to this problem instead of focusing on getting it done, and I don't see a "right" solution in front of me at all.
So In conclusion, I guess these are my primary queries for you: if you've done something like this, what has worked for you? What are the benefits/drawbacks I'm not seeing of the various designs I've proposed here? How might you design something like this, given these parameters and access patterns?
I'll be happy to provide clarification/details where needed, and thanks in advance for being awesome.
It is no problem at all to provide all this in a Relational database. PostgreSQL is not enterprise class, but it is certainly one of the better freeware SQLs.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
That is your biggest obstacle. Contrary to program design, which allows decomposition and isolated analysis/design of components, databases need to be designed as a single unit. Normalisation and other design techniques need to consider both the whole, and the component in context. The data, the descriptions, the metadata have to be evaluated together, not as separate parts.
Second, when you start off with surrogate keys, implying that you know the data, and how it relates to other data, it prevents you from genuine modelling of the data.
I have answered a very similar set of questions, coincidentally re very similar data. If you could read those answers first, it would save us both a lot of typing time on your question/answer.
Answer One/ID Obstacle
Answer Two/Main
Answer Three/Historical
I did something like this with seismic data for a petroleum exploration company.
My suggestion would be to store the meta-data in a database, and keep the sensor data in flat files, whatever that means for your computer's operating system.
You would have to write your own access routines if you want to modify the sensor data. Actually, you should never modify the sensor data. You should make a copy of the sensor data with the modifications so that you can show later what changes were made to the sensor data.

Inventory database design [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This is a question not really about "programming" (is not specific to any language or database), but more of design and architecture. It's also a question of the type "What the best way to do X". I hope does no cause to much "religious" controversy.
In the past I have developed systems that in one way or another, keep some form of inventory of items (not relevant what items). Some using languages/DB's that do not support transactions. In those cases I opted not to save item quantity on hand in a field in the item record. Instead the quantity on hand is calculated totaling inventory received - total of inventory sold. This has resulted in almost no discrepancies in inventory because of software. The tables are properly indexed and the performance is good. There is a archiving process in case the amount of record start to affect performance.
Now, few years ago I started working in this company, and I inherited a system that tracks inventory. But the quantity is saved in a field. When an entry is registered, the quantity received is added to the quantity field for the item. When an item is sold, the quantity is subtracted. This has resulted in discrepancies. In my opinion this is not the right approach, but the previous programmers here swear by it.
I would like to know if there is a consensus on what's the right way is to design such system. Also what resources are available, printed or online, to seek guidance on this.
Thanks
I have seen both approaches at my current company and would definitely lean towards the first (calculating totals based on stock transactions).
If you are only storing a total quantity in a field somewhere, you have no idea how you arrived at that number. There is no transactional history and you can end up with problems.
The last system I wrote tracks stock by storing each transaction as a record with a positive or negative quantity. I have found it works very well.
The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises
The Data Model Resource Book, Vol. 2: A Library of Data Models for Specific Industries
The Data Model Resource Book: Universal Patterns for Data Modeling
I have Vol 1 and Vol 2 and these have been pretty helpful in the past.
It depends, inventory systems are about far more than just counting items. For example, for accounting purposes, you might need to know accounting value of inventory based on FIFO (First-in-First-out) model. That can't be calculated by simple "totaling inventory received - total of inventory sold" formula. But their model might calculate this easily, because they modify accounting value as they go. I don't want to go into details because this is not programming issue but if they swear by it, maybe you didn't understand fully all their requirements they have to accommodate.
both are valid, depending on the circumstances. The former is best when the following conditions hold:
the number of items to sum is relatively small
there are few or no exceptional cases to consider (returns, adjustments, et al)
the inventory item quantity is not needed very often
on the other hand, if you have a large number of items, several exceptional cases, and frequent access, it will be more efficient to maintain the item quantity
also note that if your system has discrepancies then it has bugs which should be tracked down and eliminated
i have done systems both ways, and both ways can work just fine - as long as you don't ignore the bugs!
It's important to consider the existing system and the cost and risk of changing it. I work with a database that stores inventory kind of like yours does, but it includes audit cycles and stores adjustments just like receipts. It seems to work well, but everyone involved is well trained, and the warehouse staff aren't exactly quick to learn new procedures.
In your case, if you're looking for a little more tracking without changing the whole db structure then I'd suggest adding a tracking table (kind of like from your 'transaction' solution) and then log changes to the inventory level. It shouldn't be too hard to update most changes to the inventory level so that they also leave a transaction record. You could also add a periodic task to backup the inventory level to the transaction table every couple hours or so so that even if you miss a transaction you can discover when the change happened or roll back to a previous state.
If you want to see how a large application does it take a look at SugarCRM, they have and inventory management module though I'm not sure how it stores the data.
I think this is actually a general best-practices question about doing a (relatively) expensive count every time you need a total vs. doing that count every time something changes, then storing the count in a field and reading that field whenever you need a total.
If I couldn't use transactions, I would go with the live count every time I needed a total. If transactions are available, it would be safe to perform the inventory update operations and the saving of the re-counted total within the same transaction, which would ensure the accuracy of the count (although I'm not sure this would work with multiple users hitting the database).
But if performance is not really a huge problem (and modern databases are good enough at counting rows that I would rarely even worry about this) I'd just stick with the live count each time.
I would opt for the first way, where
the quantity on hand is calculated
totaling inventory received - total of
inventory sold
The Right Way, IMO.
EDIT: I would also want to factor in any stock losses/damages into the system, but I'm sure you have that covered.
I've worked on systems that solve this problem before. I think the ideal solution is a precomputed column, which gets you the best of both worlds. Your total would be a field somewhere, thus no expensive lookups, but it can't get out of sync with the rest of your data (the database maintains the integrity). I don't remember which RDMSs support precomputed columns, but if you don't have transactions, that might not be available either.
You could potentially fake precomputed columns (very effectively... I see no downside) using triggers. You'd probably need transactions though. IMHO, keeping data integrity when you're doing this sort of controlled denormalization is the only legitimate use for a trigger.
Django-inventory geared more to fixed assets, but might give you some ideas.
IE: ItemTemplate (class) -> ItemsOnHand (instance)
ItemsOnHand can be linked to more ItemTemplates; Example Printer & the ink cartridges is requires. This also allows to set Reorder points for each ItemOnHand.
Each ItemsOnHand is linked to InventoryTransactions, this allows for easy auditing.
To avoid calculating actual on hand items from thousand of invetory transactions, checkpoints are used which are just a balance + a date. To calculate items on hand query to find the most recent checkpoint and start adding or substracting items to find the current balance of items. Define new checkpoints periodically.
I can see some benefit to having the two columns, but I'm not following the part about discrepancies - you seem to be implying that having the two columns (in and out) is less prone to discrepancy than a single column (current). Why is that?
Is not having one or two columns, what I meant with "totaling inventory received - total of inventory sold" is something like this:
Select sum(quantity) as inventory_received from Inventory_entry
Select sum(quantity) as inventory_sold from Sales_items
then
Qunatity_on_hand = inventory_received - inventory_sold
Please keep in mind that I oversimplified this and my initial explanation. I know there is much more to inventory that just keeping track of quantities, but in this case that's were the problem lies and what we want to fix. At this point the reason to change it is preciselly the cost of supporting the problems caused by the current design.
Also I wanted to mention that although this is not a "coding" question is related to algoritms and design which IMHO are very important topics.
Thanks everybody for your answers so far.
Nelson Marmol
We solve different problems, but our approach to some of them might be interesting to you.
We allow the system to make a "best guess", and give the users regular feedback about any of those guesses that look wrong.
To apply this to inventory, you could have 3 fields:
inventory_received
inventory_sold
estimated_on_hand
Then, you could run a process (daily?) along the lines of:
SELECT *
FROM Inventory
WHERE estimated_on_hand != inventory_received - inventory_sold
Of course, this relies on users looking at this alert, and doing something about it.
Also, you could have a function to reset inventory some how, either by updating inventory_sold/received, or perhaps adding another field "inventory_adjustment", which could be positive or negative.
... just some thoughts. Hope it's helpful.

Resources