History table tree - database

I have represented a tree in a relational table. Each nodes are records that could change values, I can insert,delete or group some nodes too.
Which is the best approach to make an history table of all this changes?
Is a database design pattern avaible?

This doesn't really seem like a design pattern as more a DB design.
Important question:
1. How secure do you need this historical record? Is it safe to store in the DB itself or are you concerned that the DB might be compromised and you need an unmolestable record?
Assuming you can store this data in the db, then there isn't much to design
Create a trigger on whatever meta-table your DB uses to record new tables and have IT create a new trigger on the table just created. This way, it's maintenance free.
Gather the following information in a flat table: who, what, when
When you create your historical table make sure you create triggers preventing updates to any record or any deletes. The record should be append only!

Thank you for answer.
About your security question,I haven't any restriction so I can store historical data in the DB.
Assuming that Historical data can be modeled in Relational Database, in several techniques such as separated tables for history records and transaction logs, I have found :
Row-based Auditing - This technique creates a separated table for each
relational table to maintain historical data. The auditing table contains every column of the operational table times; start time and end time to maintain the lifespan of
the data. Two additional attributes; operation type and username.
Column-Based Auditing - The column-based auditing solves the redundancy of
the row-based auditing. Data in historical column of auditing table are stored only the changed value except the primary key, such as ID, which is used to reference its operational table.
Log-Table Auditing - Log tables have been used for transaction management in the relational database for a long time. Due to the nature of transaction that needs to know
operation, data, and time of execution, log tables can be utilized for the auditing purpose too.
For my problem I am guessing that the Row-based Auditing may be a good approach because there is the possibility to track the the operation too and if I add some attribute like root - leave I can store the structure of the tree.
Do you have suggestion, from design Db point of view, or some tutorial or article in order to start the implementation ?
Thank you.

Related

Snapshot Tables with Foreign Keys vs. Snapshot Tables with Real Values

On one of our client's database there are a few snapshot tables that summarize useful information from many other tables (e.g. what was the state of each customer in each period, etc).
The snapshot tables however, contain mostly foreign keys to their original tables. Therefore in order to obtain useful information about the snapshot, we have to join them multiple times to their corresponding tables. And these joins often take very long. Adding indexes to all FL columns in databases (or at least on columns in WHERE clauses in our queries) on the other hand, slows down the database significantly.
So my question is, wouldn't it be better to have snapshot tables with real values instead of foreign keys? And if the answer is negative, wouldn't it beat the purpose of snapshot tables if the original tables are updated (e.g. if an was called 'Candle' and now 'Lamp' of course are snapshot remains consistent but is it really snapshot in this case?)
I'd lean towards storing the actual data rather than FK values for the reason you mentioned. That said, a better solution might be to relocate this historical data along with relevant attributes (IE Dimensions) and restructure it for analysis. Data warehousing is certainly a solution for this, although these can be very large-scale projects so you'd need to understand the value and scope it appropriately. However, even a light-weight star schema that targets the specific events they're trying to capture could be a better solution than a large historical table with relationships to transaction-based tables (especially if the query logic against the related tables is complex).

What is the best way to implement soft deletion in a large relational Database? [duplicate]

Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete

Quickly update a large amount of rows in a table, without blocking inserts on referencing tables, in SQL Server 2008

Context:
I have a system that acts as a Web UI for a legacy accounting system. This legacy system sends me a large text file, several times a day, so I can update a CONTRACT table in my database (the file can have new contracts, or just updated values for existing contracts). This table currently has around 2M rows and about 150 columns. I can't have downtime during these updates, since they happen during the day and there's usually about 40 logged users in any given time.
My system's users can't update the CONTRACT table, but they can insert records in tables that reference the CONTRACT table (Foreign Keys to the CONTRACT table's ID column).
To update my CONTRACT table I first load the text file into a staging table, using a bulk insert, and then I use a MERGE statement to create or update the rows, in batches of 100k records. And here's my problem - during the MERGE statement, because I'm using READ COMMITED SNAPSHOT isolation, the users can keep viewing the data, but they can't insert anything - the transactions will timeout because the CONTRACT table is locked.
Question: does anyone know of a way to quickly update this large amount of rows, while enforcing data integrity and without blocking inserts on referencing tables?
I've thought about a few workarounds, but I'm hoping there's a better way:
Drop the foreign keys. - I'd like to enforce my data consistency, so this don't sound like a good solution.
Decrease the batch size on the MERGE statement so that the transaction is fast enough not to cause timeouts on other transactions. - I have tried this, but the sync process becomes too slow; Has I mentioned above, I receive the update files frequently and it's vital that the updated data is available shortly after.
Create an intermediate table, with a single CONTRACTID column and have other tables reference that table, instead of the CONTRACT table. This would allow me to update it much faster while keeping a decent integrity. - I guess it would work, but it sounds convoluted.
Update:
I ended up dropping my foreign keys. Since the system has been in production for some time and the logs don't ever show foreign key constraint violations, I'm pretty sure no inconsistent data will be created. Thanks to everyone who commented.

Database design

Our database is part of a (specialized) desktop application.
The primary goal is to keep data about certain events.
Events happen every few minutes.
The data collected about events changes frequently with new data groups being added in and old ones swapped out almost monthly (the data comes in definite groups).
I have to put together a database to track the events. A first stab at that might be to simply have a single big table where each row is an event and that is basically what our data looks like, but this seems undesirable because of our constantly changing groups of data (i.e. the number of columns would either keep growing perpetually or we would constantly having this months database incompatible with last months database - ugh!). Because of this I am leadning toward the following even though it creates circular references. (But maybe this is a stupid idea)
Create tables like
Table Events
Table Group of the Month 1
Table Group of the Month 2
...
Table Events has:
A primary key whose deletion cascade to delete rows with foreign keys referencing it
A nullable foreighn key for each data group table
Each data group table has:
A primary key, whose deletion cascades to null out foreign keys referencing it
Columns for the data in that group
A non-nullable foreign key back to the event
This still leaves you with a growing, changing Event Table (as you need to add new foreign key columns for each new data group), just much less drastically. However it seems more modular to me than one giant table. Is this a good solution to this situation? If not, what is?
Any suggestions?
P.S. We are using SQL Express or SQL Compact (we are currently experimenting with which one suits us best)
Why not use basically the single table approach and store the changing event data as XML in an XML column? You can even use XSD schemas to account for the changing data types, and you can add indexes on XML data if fast query performance on some XML data is required.
A permanently changing DB schema wouldn't really be an option if I were to implement such a database.
You should not have foreign keys for each data group table in the event table.
The event table already has an event_id which is in each data group table. So you can get from event to the child tables. Furthermore there will be old rows in the event table that are not in the latest data group table. So you really can't have a foreign key.
That said, I would wonder whether there is additional structure in the data group tables that can be used to clean up your design. Without knowing anything about what they look like I can't say. But if there is, consider taking advantage of it! (A schema that changes every month is a pretty bad code smell.)
Store your data at as granular a level as possible. It might be as simple as:
EventSource int FK
EventType int FK
Duration int
OccuredOn datetime
Get the data right and as simple as possible in the first place, and then
Aggregate via views or queries. Your instincts are correct about the ever changing nature of the columns - better to control that in T-SQL than in DDL.
I faced this problem a number of years ago with logfiles for massive armies of media players, and where I ultimately ended up was taking this data and creating a OLAP cube out of it. OLAP is another approach to database design where the important thing is optimizing it for reporting and "sliceability". It sounds like you're on that track, where it would be very useful to be able to look at a quick month's view of data, then a quarter's, and then back down to a week's. This is what OLAP is for.
Microsoft's technology for this is Analysis Services, which comes as part of Sql Server. If you didn't want to take the entire plunge (OLAP has a pretty steep learning curve), you could also look at doing a selectively denormalized database that you populated each night with ETL from your source database.
HTH.

What's difference between a temporal database and a historical archive database?

It is said here:
http://www.ibm.com/developerworks/web/library/wa-dbdsgn2.html
Each table in the DB should have a history table, mirroring the entire
history of the primary table. If
entries in the primary table are to be
updated, the old contents of the
record are first copied to the history
table before the update is made. In
the same way, deleted records in the
primary table are copied to the
history table before being deleted
from the primary one. The history
tables always have the name of the
corresponding primary one, but with
_Hist appended.
In temporal db see here temporal database modeling and normalisation there isn't a separate table as far as I understand.
So when should I create another table or not ?
What Robert said theoretically - nothing to add.
Practically, temporal table vs. main+hist table, has other impications.
For heavily maintained data (e.g. updates/deletes greatly outnumber the inserts), having a historical (sometimes also referred to as "audit" - as it is the main mechanism to enforce audit trail of DB data) table allows keeping the main table reasonably small sized compared to keeping the audit info inside the main table itself. This can have significant performance implications for both selects and inserts on the main table, especially in light of index optimization discussed below.
To top that off, the indices on hist/audit table do not need to be 100% identical to main table, meaning you can omit indices not needed for querying audit data from hist database (thus speeding up inserts into audit table) and, vice versa, optimize what indices there are to specific audit queries you have (including ordering the table by timestamp via clustered index) without saddling the main table with those indices which slow the data changes (and in case of clustering on time of update, clash with main table's clustered index so you usually can't have it clustered in temporal order).
History tables provide a history of (generally non-temporal) changes made to the primary database records by users. This history is archival in nature (i.e. accessed occasionally for historical purposes). The temporal information (when the change was made) is secondary in nature.
A temporal database is designed specifically to execute time queries against. The temporal information is primary in nature, and kept online for immediate retrieval. A second table is not created, unless archiving also needs to take place.
http://en.wikipedia.org/wiki/Temporal_database
The history table that is talked of in that developerworks article is a table that holds the history of the database (i.e. the history of our beliefs about reality).
The kind of history that you asked about in that other thread holds our (current !) belief about the history of reality.
Note the difference. The two concur only to the extent that our past beliefs about reality have indeed been correct. And that is not always 100%.
If you use the former as being the latter, then you are in a sense assuming that that degree of concurrence is indeed 100%, i.e. that all your past beliefs about reality always and by definition coincided with reality, i.e. you are assuming that it is impossible for you to have had any faulty belief about reality.
Tables that hold the history of other tables can suit purposes of auditing. Tables that hold the history of reality can suit the purpose of any user that is interested in that historical information.

Resources