I was working with one product where almost every table had those columns. As developers we constantly had to join to Users table to get Id of who created record and it's just a mess in a code.
I'm designing new product and thinking about this again. Does it have to be like this? Obviously, it is good to know who created record and when. But having 300+ tables reference same User table doesn't seem to be very good..
How do you handle things like this? Should I create CreatedBy column only on major entities where it's most likely needed on UI and than deal with joining? Or should I go and put it everywhere? Or maybe have another "Audit" table where I store all this and look it up only on demand(not every time entity displayed on UI)
I'm just worrying about performance aspect where every UI query will hit User table..
EDIT: This is going to be SQL Server 2008 R2 database
The problem with that approach is that you only know who created the row and who changed the row last. What if the last person to update the row was correcting the previous updater's mistake?
If you're interested in doing full auditing for compliance or accountability reasons, you should probably look into SQL Server Audit. You can dictate which tables you're auditing, can change those on the fly without having to mess with your schema, and you can write queries against this data specifically instead of mixing the auditing logic with your normal application query logic (never mind widening every row of the table itself). This will also allow you to audit SELECT queries, which other potential solutions (triggers, CDC, Change Tracking - all of which are either more work or not complete for true auditing purposes) won't let you do that.
I know that this is an older post, but one way to avoid the lookup on the user table is to de-normalize the audit fields.
So instead of a userid in the CreatedBy field you insert a username itself. This will allow for a review of the table without the user look and also allow for any changes in your user table not reflect in the audit fields. Such as deleted users.
I usually add the following to the end of a table
IsDeleted bit default 0
CreatedBy varchar(20)
CreatedOn datetime2 default getdate()
UpdatedBy varchar(20)
UpdatedOn datetime2 default getdate()
Related
I'm creating an Account table in my project's database. Each account has A LOT of properties:
login
email
password
birthday
country
avatarUrl
city
etc.
Most of them are nullable. My question is, how should I design this in database?
Should it be one table with all those properties? Or maybe should I create two tables, like AccountSet, and AccountInfoSet, where I would store all those 'advanced' user's settings? And last, but not least: if this should be two tables, what kind of relation should be between those tables?
If this is a relational database, then I definitely would not store those properties as fields in the Account table. Some reasons why:
Once your application goes to production (or maybe it's already there), the schema maintenance will become a nightmare. You will absolutely add more properties and having to constantly touch that table in production will be painful.
You will most likely end up with orphaned fields. I've seen this many times where you'll introduce a property and then stop using it, but it's baked into your schema and you might be too scared to remove it.
Ideally you want to avoid having such sparse data in a table (lots of fields with lots of nulls).
My suggestion would be to do what you're already thinking about and that's to introduce a property table for Accounts. You called it AccountInfoSet.
The table should look like this:
AccountId int,
Property nvarchar(50),
Value nvarchar(50)
(Of course you'll set the data types and sizes as you see fit.)
Then you'll join to the AccountInfoSet table and maybe pivot on the "advanced" properties - turn the rows into columns with a query.
In .NET you can also write a stored procedure that returns two queries with one call and look at the tables in the DataSet object.
Or you could just make two separate calls. One for Account and one for the properties.
Lots of ways to get the information out, but make sure you don't just add fields to Account if you're using a relational database.
I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.
We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.
I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.
For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.
There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).
A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.
In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.
I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.
I started an ASP.NET project with Entity Framework 4 for my DAL, using SQL Server 2008. In my database, I have a table Users that should have many rows (5.000.000 for example).
Initially I had my Users table designed like this:
Id uniqueidentifier
Name nvarchar(128)
Password nvarchar(128)
Email nvarchar(128)
Role_Id int
Status_Id int
I've modified my table, and added a MarkedForDeletion column:
Id uniqueidentifier
Name nvarchar(128)
Password nvarchar(128)
Email nvarchar(128)
Role_Id int
Status_Id int
MarkedForDeletion bit
Should I delete every entity each time, or use the MarkedForDeletion attribute. This means that I need to update the value and at some moment in time to delete all users with the value set to true with a stored procedure or something similar.
Wouldn't the update of the MarkedForDeletion attribute cost the same as a delete operation?
Depending on the requirements/needs/future needs of your system, consider moving your 'deleted' entities over to a new table. Setup an 'audit' table to hold those that are deleted. Consider the case where someone wants something 'restored'.
To your question on performance: would the update be the same cost as a delete? No. The update would be a much lighter operation, especially if you had an index on the PK (errrr, that's a guid, not an int). The point being that an update to a bit field is much less expensive. A (mass) delete would force a reshuffle of the data. Perhaps that job belongs during a downtime or a low-volume period.
Regarding performance: benchmark it to see what happens! Given your table with 5 million rows, it'd be nice to see how your SQL Server performs, in its current state of indexes, paging, etc, with both scenarios. Make a backup of your database, and restore into a new database. Here you can sandbox as you like. Run & time the scenarios:
mass delete vs.
update a bit or smalldatetime field vs.
move to an audit table
In terms of books, try:
this answer re: books
a recommendation for Adam Mechanic's book
another question on database books.
This may depend on what you want to do with the information. For instance, you may want to mark a user for deletion but not delte all his child records (say something like forum posts), inthsi case you should markfor deletion or use a delted date field. If you do this, create a view to use for all active users (called ActiveUsers) , then insist that the view beused in any query for login or where you only want to see the active users. That will help prevent query errors from when you forget to exlude the inactive ones. If your system is active, do not make this change without going through and adjusting all queries that need to use the new view.
Another reason to use the second version is to prevent slowdowns when delting large numbers of child records. They no longer need to be deleted if you use a deleted flag. This can help performance becasue less resources are needed. Additionally you can flag records for deltion and then delte them inthe inthe middle of the night (or move to a history table) to keep the main tables smaller but still not affect performance during peak hours.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a question regarding the two additional columns (timeCreated, timeLastUpdated) for each record that we see in many solutions. My question: Is there a better alternative?
Scenario: You have a huge DB (in terms of tables, not records), and then the customer comes and asks you to add "timestamping" to 80% of your tables.
I believe this can be accomplished by using a separate table (TIMESTAMPS). This table would have, in addition to the obvious timestamp column, the table name and the primary key for the table being updated. (I'm assuming here that you use an int as primary key for most of your tables, but the table name would most likely have to be a string).
To picture this suppose this basic scenario. We would have two tables:
PAYMENT :- (your usual records)
TIMESTAMP :- {current timestamp} + {TABLE_UPDATED, id_of_entry_updated, timestamp_type}
Note that in this design you don't need those two "extra" columns in your native payment object (which, by the way, might make it thru your ORM solution) because you are now indexing by TABLE_UPDATED and id_of_entry_updated. In addition, timestamp_type will tell you if the entry is for insertion (e.g "1"), update (e.g "2"), and anything else you may want to add, like "deletion".
I would like to know what do you think about this design. I'm most interested in best practices, what works and scales over time. References, links, blog entries are more than welcome. I know of at least one patent (pending) that tries to address this problem, but it seems details are not public at this time.
Cheers,
Eduardo
While you're at it, also record the user who made the change.
The flaw with the separate-table design (in addition to the join performance highlighted by others) is that it makes the assumption that every table has an identity column for the key. That's not always true.
If you use SQL Server, the new 2008 version supports something they call Change Data Capture that should take away a lot of the pain you're talking about. I think Oracle may have something similar as well.
Update: Apparently Oracle calls it the same thing as SQL Server. Or rather, SQL Server calls it the same thing as Oracle, since Oracle's implementation came first ;)
http://www.oracle.com/technology/oramag/oracle/03-nov/o63tech_bi.html
I have used a design where each table to be audited had two tables:
create table NAME (
name_id int,
first_name varchar
last_name varchar
-- any other table/column constraints
)
create table NAME_AUDIT (
name_audit_id int
name_id int
first_name varchar
last_name varchar
update_type char(1) -- 'U', 'D', 'C'
update_date datetime
-- no table constraints really, outside of name_audit_id as PK
)
A database trigger is created that populates NAME_AUDIT everytime anything is done to NAME. This way you have a record of every single change made to the table, and when. The application has no real knowledge of this, since it is maintained by a database trigger.
It works reasonably well and doesn't require any changes to application code to implement.
I think I prefer adding the timestamps to the individual tables. Joining on your timestamp table on a composite key -- one of which is a string -- is going to be slower and if you have a large amount of data it will eventually be a real problem.
Also, a lot of the time when you are looking at timestamps, it's when you're debugging a problem in your application and you'll want the data right there, rather than always having to join against the other table.
One nightmare with your design is that every single insert, update or delete would have to hit that table. This can cause major performance and locking issues. It is a bad idea to generalize a table like that (not just for timestamps). It would also be a nightmare to get the data out of.
If your code would break at the GUI level from adding fields you don't want the user to see, you are incorrectly writing the code to your GUI which should specify only the minimum number of columns you need and never select *.
The advantage of the method you suggest is that it gives you the option of adding other fields to your TIMESTAMP table, like tracking the user who made the change. You can also track edits to sensitive fields, for example who repriced this contract?
Logging record changes in a separate file means you can show multiple changes to a record, like:
mm/dd/yy hh:mm:ss Added by XXX
mm/dd/yy hh:mm:ss Field PRICE Changed by XXX,
mm/dd/yy hh:mm:ss Record deleted by XXX
One disadvantage is the extra code the will perform inserts into your TIMESTAMPS table to reflect changes in your main tables.
If you set up the time-stamp stuff to run off of triggers, than any action that can set off a trigger (Reads?) can be logged. Also there might be some locking advantages.
(Take all that with a grain of salt, I'm no DBA or SQL guru)
Yes, I like that design, and use it with some systems. Usually, some variant of:
LogID int
Action varchar(1) -- ADDED (A)/UPDATED (U)/DELETED (D)
UserID varchar(20) -- UserID of culprit :)
Timestamp datetime -- Date/Time
TableName varchar(50) -- Table Name or Stored Procedure ran
UniqueID int -- Unique ID of record acted upon
Notes varchar(1000) -- Other notes Stored Procedure or Application may provide
I think the extra joins you will have to perform to get the Timestamps will be a slight performance hit and a pain the neck. Other than that I see no problem.
We did exactly what you did. It is great for the object model and the ability to add new stamps and differant types of stamps to our model with minimal code. We were also tracking the user that made the change, and a lot of our logic was heavily based on these stamps. It woked very well.
One drawback is reporting, and/or showing a lot of differant stamps on on screen. If you are doing it the way we did it, it caused a lot of joins. Also,back ending changes was a pain.
Our solution is to maintain a "Transaction" table, in addition to our "Session" table. UPDATE, INSERT and DELETE instructions are all managed through a "Transaction" object and each of these SQL instruction is stored in the "Transaction" table once it has been successfully executed on the database. This "Transaction" table has other fields such as transactiontType (I for INSERT, D for DELETE, U for UPDATE), transactionDateTime, etc, and a foreign key "sessionId", telling us finally who sent the instruction. It is even possible, through some code, to identify who did what and when (Gus created the record on monday, Tim changed the Unit Price on tuesday, Liz added an extra discount on thursday, etc).
Pros for this solution are:
you're able to tell "what who and when", and to show it to your users! (you'll need some code to analyse SQL statements)
if your data is replicated, and replication fails, you can rebuild your database through this table
Cons are
100 000 data updates per month mean 100 000 records in Tbl_Transaction
Finally, this table tends to be 99% of your database volume
Our choice: all records older than 90 days are automatically deleted every morning
Philippe,
Don't simply delete those older than 90 days, move them first to a separate DB or write them to text file, do something to preserve them, just move them out of the main production DB.
If ever comes down to it, most often it is a case of "he with the most documentation wins"!