I have represented a tree in a relational table. Each nodes are records that could change values, I can insert,delete or group some nodes too.
Which is the best approach to make an history table of all this changes?
Is a database design pattern avaible?
This doesn't really seem like a design pattern as more a DB design.
Important question:
1. How secure do you need this historical record? Is it safe to store in the DB itself or are you concerned that the DB might be compromised and you need an unmolestable record?
Assuming you can store this data in the db, then there isn't much to design
Create a trigger on whatever meta-table your DB uses to record new tables and have IT create a new trigger on the table just created. This way, it's maintenance free.
Gather the following information in a flat table: who, what, when
When you create your historical table make sure you create triggers preventing updates to any record or any deletes. The record should be append only!
Thank you for answer.
About your security question,I haven't any restriction so I can store historical data in the DB.
Assuming that Historical data can be modeled in Relational Database, in several techniques such as separated tables for history records and transaction logs, I have found :
Row-based Auditing - This technique creates a separated table for each
relational table to maintain historical data. The auditing table contains every column of the operational table times; start time and end time to maintain the lifespan of
the data. Two additional attributes; operation type and username.
Column-Based Auditing - The column-based auditing solves the redundancy of
the row-based auditing. Data in historical column of auditing table are stored only the changed value except the primary key, such as ID, which is used to reference its operational table.
Log-Table Auditing - Log tables have been used for transaction management in the relational database for a long time. Due to the nature of transaction that needs to know
operation, data, and time of execution, log tables can be utilized for the auditing purpose too.
For my problem I am guessing that the Row-based Auditing may be a good approach because there is the possibility to track the the operation too and if I add some attribute like root - leave I can store the structure of the tree.
Do you have suggestion, from design Db point of view, or some tutorial or article in order to start the implementation ?
Thank you.
I am looking at a database which has almost no foreign keys defined.
Is there a tool that can perform some data analysis/heuristics and "guess" the relations based on data. I am looking for some kind of report, which can be used as a manual guide/checklist.
I had a similar problem - Every Table had a Object_ID column... But had secondary IDs too.
All were of a wierd GUID-ish form.
I ended up writing a brute force scanner (using Dynamic sql from informtion_schema.columns)
Of course this approach relied on the values being globally unique... If you have a bunch of int identity cols and no way to connect the Tables then you are in a bit of trouble!
Perhaps there is a timestamp column or a DateTime defaulting to GetDate() - you could use this to identidy records in different tables that are created at approx the same time.
A lot depends on your schema...
Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete
The website I'm create has "Events". Events have a title, date, and the userids of the people involved. An event can be anything from following someone, creating a new post, etc. I was thinking of storing all events in a single table, but I could see this getting very big very quickly.
Am I doing it right? When I need to search the table for, say, event pertaining to a certain user, how bad of a toll would that be on the system? Could I optimise it somehow?
You would add indexes on the columns you most frequently use in WHERE clauses, e.g. if you are frequently selecting all events that pertain to a certain user, you should create an index on the user_id column.
http://www.postgresql.org/docs/9.1/static/sql-createindex.html
As long as the data in that table is normalized, you should be OK. If you find that read queries on that table slow down, you can add an index to some of the columns, but you should keep in mind that this will slow down writes to that table.
If you find that the performance is too slow, you can switch to using some NoSQL database that's better optimized for large tables.
If table will be really big, you can use partitioning:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
but you must choose a good partition key - good candidates are:
event timestamp
event type
user_id
I have an excel spreadsheet i am going to be turning into a DB to mine data and build an interactive app. There are about 20 columns and 80,000 records. Practically all records have about half of their column data as null, but which column has data is random for each record.
The options would be to:
Create a more normalized DB with a table for each column and use 20 joins to view all data. I would think the benefits would be a DB with really no NULL values so the size would be smaller. One of the major cons would be more code to update each table from the application side.
Create a flat file with one table that has all columns. I figure this will be easier for the application side to do updates, but will result in a table that has a butt load of empty dataspace.
I don't get why you think updating a normalized db is harder than a flat table. It's very much the other way around.
Think about inserting a relation between a customer and a product (basically an order). You'd have to:
select the row that describes the rest of the data, but has nulls or something in the product columns
you have to update the product columns
you have to insert a HUGE row to the db
What about the first time? What do you do with the initial nulls? Do you modify your selects to ignore them? What if you want the nulls?
What if you delete the last product? Do you change it into an update and set nulls for just a few columns?
Joins aside, working with a normalized table is trivial by design. You pay for its triviality with performance, that's the actual trade-off.
If you are going to be using a relational database, you should normalize your tables, if nothing else in order to ease data maintenance and ensure you don't have duplicate data.
You can investigate the use of a document database for storage instead of a relational database, though it is not the only option.
Generally normalized databases will end up being easier to write code against as SQl code is deisgned with normalized tables in mind.
Normalizing doesn't have to be done on all columns, so there's a middle ground between the two options you present. A good rule of thumb is that if you have columns that have values being repeated heavily across records, those can be good candidates for normalizing into one or more separate tables. Putting each column in its own table and joining across them is almost certainly overdoing it.
Don't normalize too much. It's hard to maintain a canonical model as your application grows. Storage is cheap. Don't get fooled into coding head aches because of concerns that were valid 20 years ago. No need to go nosql unless you need it.