Database design question. BIT column for deletions - sql-server

Part of my table design is to include a IsDeleted BIT column that is set to 1 whenever a user deletes a record. Therefore all SELECTS are inevitable accompanied by a WHERE IsDeleted = 0 condition.
I read in a previous question (I cannot for the love of God re-find that post and reference it) that this might not be the best design and an 'Audit Trail' table might be better.
How are you guys dealing with this problem?
Update
I'm on SQL Server. Solutions for other DB's are welcome albeit not as useful for me but maybe for other people.
Update2
Just to encapsulate what everyone said so far. There seems to be basically 3 ways to deal with this.
Leave it as it is
Create an audit table to keep track of all the changes
Use of views with WHERE IsDeleted = 0

Therefore all SELECTS are inevitable accompanied by a WHERE IsDeleted = 0 condition.
This is not a really good way to do it, as you probably noticed, it is quite error-prone.
You could create a VIEW which is simply
CREATE VIEW myview AS SELECT * FROM yourtable WHERE NOT deleted;
Then you just use myview instead of mytable and you don't have to think about this damn column in SELECTs.
Or, you could move deleted records to a separate "archive" table, which, depending on the proportion of deleted versus active records, might make your "active" table a lot smaller, better cached in RAM, ie faster.

If you have to have this kind of Deleted Bit column, then you really should consider setting up some VIEWs with the WHERE clause in it, and use those rather than the underlying tables. Much less error prone.
For example, if you have this view:
CREATE VIEW [Current Product List] AS
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
Then someone who wants to see current products can simply write:
SELECT * FROM [Current Product List]
This is much less error prone than writing:
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
As you say, people will forget that WHERE clause, and get confusing and incorrect results.
P.S. the example SQL comes from Microsoft's Northwind database. Normally I would recommend NOT using spaces in column and table names.

We're actively using the "Deleted" column in our enterprise software. It is however a source of constant errors when forgetting to add "WHERE Deleted = 0" to an SQL query.
Not sure what is meant by "Audit Trail". You may wish to have a table to track all deleted records. Or there may be an option of moving the deleted content to paired tables (like Customer_Deleted) to remove the passive content from tables to minimize their size and optimize performance.

A while ago there was some blog uproar on this issue, Ayende and Udi Dahan both posted on this.

Nai this is totally up to you.
Do you need to be able to see who has deleted / modified / inserted what and when? If so, you should design the tables for this and adjust your procs to write these values when they are called.
If you dont need an audit trail, dont waste time with one. Just do as you are with IsDeleted.
Personally, I flag things right now, as an audit trail wasn't specified in my spec, but that said, I don't like to actually delete things. Hence, I chose to flag it. I'm not going to waste a clients time writing something they diddn't request. I wont mess about with other tables because that's another thing for me to think about. I'd just make sure my index's were up to the job.
Ask your manager or client. Plan out how long the audit trail would take so they can cost it and let them make the decision for you ;)

Udi Dahan said this:
Model the task, not the data
Looking back at the story our friend from marketing told us, his intent is to discontinue the product – not to delete it in any technical sense of the word. As such, we probably should provide a more explicit representation of this task in the user interface than just selecting a row in some grid and clicking the ‘delete’ button (and “Are you sure?” isn’t it).
As we broaden our perspective to more parts of the system, we see this same pattern repeating:
Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.
Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.
Jobs aren’t deleted – they’re filled (or their requisition is revoked).
In all cases, the thing we should focus on is the task the user wishes to perform, rather than on the technical action to be performed on one entity or another. In almost all cases, more than one entity needs to be considered.

If you have Oracle DB, then you can use audit trail for auditing. Check the AUDIT VAULT tool form OTN, here. It even supports SQL Server.

Views (or stored procs) to get at the underlying table data are the best way. However, if you have the problem with "too many cooks in the kitchen" like we do (too many people have rights to the data and may just use the table without knowing enough to use the view/proc) you should try using another table.
We have a complete mimic of the base table with a few extra columns for tracking. So Employee table has an EmployeeDeleted table with the same schema but extra columns for when it was deleted and who deleted it and sometimes even the reason for deletion. You can even get fancy and have triggers do the insertion directly instead of going through applications/procs.
Biggest Advantage: no flag to worry about during selects
Biggest Disadvantage: any schema changes to the base table also have to be made on the "deleted" table
Best for: situations where for whatever reason (usually political with us) many not-as-experienced people have rights to the data but still expect it to be accurate without having to understand flags or schemas, etc

I've used soft deletes before on a number of applications I've worked on, and overall it's worked out quite well. Yes, there is the issue of always having to remember to add AND IsActive = 1 to all of your SELECT queries, but really that's not so bad. You can create views if you don't want to have to remember to always do that.
The reason we've done this is because we had very specific business needs to be able to report on records that have been deleted. The reporting needs varied widely - sometimes they'd need to see just the active records, or just the inactive records, or sometimes a mix of both - so pushing all the deleted records into an audit table wasn't a very good option.
So, depending on your particular business needs, I think this approach is certainly a viable option.

Related

Postgres: How to clear transaction ID for anonymity and data reduction

We're running an evaluation platform where users can comment on certain things. A key feature is that people can comment only once, and every comment is made in anonymity.
We're using Postgres for all our data. We want to save a flag in the database that a user created a comment (so they cannot comment again). In a separate table but within the same transaction, we want to save the comment itself without any link to the user.
However, postgres saves the transaction ID of every tuple inserted into the database (xmin of the system columns). So now there's a link between the user and their comment which we have to avoid!
Possible (Non)Solutions
Vacuuming alone does not help as it does not clear the transaction ID. See the "Note" box in the "24.1.5. Preventing Transaction ID Wraparound Failures" section in the postgres docs.
Putting those inserts in different transactions, doesn't really solve anything since transaction IDs are consecutive.
We could aggregate comments from multiple users to one large text in the database with some separators, but since old versions of this large text would be kept by postgres at least until the next vacuum, that doesn't seem like a full solution. Also, we'd still have the order of when the user added their comment, which would be nice to not save as well.
Re-writing all the tuples in those tables periodically (by a dummy UPDATE to all of them), followed by a vacuum would probably erase the "insert history" sufficiently, but that too seems like a crude hack.
Is there any other way within postgres to make it impossible to reconstruct the insertion history of a table?
Perhaps you could use something like dblink or postgres_fdw to write to tables using a remote connection (either to the current database or another database), and thereby separate xmin values, even though you as a user think you are doing it all in the "same transaction."
Regarding the concerns about tracking via reverse-engineering sequential xmin values, since dblink is asynchronous, this issue may become moot at scale, when many users are simultaneously adding comments to the system. This might not work if you need to be able to rollback after encountering an error—it really depends on how important it is for you to confine the operations into one transaction.
I don't think there is a problem.
In your comment you write that you keep a flag with the user (however exactly you store it) that keeps track of which posting the user commented on. To keep that information private, you have to keep that flag private so that nobody except the user itself could read it.
If no other user can see that information, then no other user can see the xmin on the corresponding table entries. Then nobody could make a correlation with the xmin on the comment, so the problem is not there.
The difficult part is how you want to keep the information private which postings a user commented on. I see two ways:
Don't use database techniques to do it, but write the application so that it hides that information from the users.
Use PostgreSQL Row Level Security to do it.
There is no way you can keep the information from a superuser. Don't even try.
You could store the users with their flags and the comments on different database clusters (and use distributed transactions), then the xmins would be unrelated.
Make sure to disable track_commit_timestamp.
To make it impossible to correlate the transactions in the databases, you could issue random
SELECT txid_current();
which do nothing but increment the transaction counter.

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

Names of businesses keyed differently by different people

I have this table
tblStore
with these fields
storeID (autonumber)
storeName
locationOrBranch
and this table
tblPurchased
with these fields
purchasedID
storeID (foreign key)
itemDesc
In the case of stores that have more than one location, there is a problem when two people inadvertently key the same store location differently. For example, take Harrisburg Chevron. On some of its receipts it calls itself Harrisburg Chevron, some just say Chevron at the top, and under that, Harrisburg. One person may key it into tblStore as storeName Chevron, locationoOrBranch Harrisburg. Person2 may key it as storeName Harrisburg Chevron, locationOrBranch Harrisburg. What makes this bad is that the business's name is Harrisburg Chevron. It seems hard to make a rule (that would understandably cover all future opportunities for this error) to prevent people from doing this in the future.
Question 1) I'm thinking as the instances are found, an update query to change all records from one way to the other is the best way to fix it. Is this right?
Questions 2) What would be the best way to have originally set up the db to have avoided this?
Question 3) What can I do to make future after-the-fact corrections easier when this happens?
Thanks.
edit: I do understand that better business practices are the ideal prevention, but for question 2 I'm looking for any tips or tricks that people use that could help. And question 1 and 3 are important to me too.
This is not a database design issue.
This is an issue with the processes around using the database design.
The real question I have is why are users entering in stores ad-hoc? I can think of scenarios, but without knowing your situation it is hard to guess.
The normal solution is that the tblStore table is a lookup table only. Normally users only have access to stores that have already been entered.
Then there is a controlled process to maintain the tblStore table in a consistent manner. Only a few users would have access to this process.
Of course as I alluded to above this is not always possible, so you may need a different solution.
UPDATE:
Question #1: An update script is the best approach. The best way to do this is to have a copy of the database if possible, or a close copy if not, and test the script against this data. Once you have ensured that the script runs correctly, then you can run it against the real data.
If you have transactional integrity you should use that. Use "begin" before running the script and if the number of records is what you expect, and any other tests you devise (perhaps also scripted), then you can "commit"
Do not type in SQL against a live DB.
Question #3: I suggest your first line of attack is to create processes around the creation of new stores, but this may not be wiuthin your ambit.
The second is possibly to get proactive and identify and enter new stores (if this is the problem) before the users in the field need to do so. I don't know if this works inside your scenario.
Lastly if you had a script that merged "store1" into "store2" you can standardise on that as a way of reducing time and errors. You could even possibly build that into an admin only screen that automated merging stores.
That is all I can think of off the top of my head.

Database: To delete or not to delete records

I don't think I am the only person wondering about this. What do you usually practice about database behavior? Do you prefer to delete a record from the database physically? Or is it better to just flag the record with a "deleted" flag or a boolean column to denote the record is active or inactive?
It definitely depends on the actual content of your database. If you're using it to store session information, then by all means wipe it immediately when the session expires (or is closed), you don't want that garbage lying around. As it cannot really be used again for any practical purposes.
Basically, what you need to ask yourself, might I need to restore this information? Like deleted questions on SO, they should definitely just be marked 'deleted', as we're actively allowing an undelete. We also have the option to display it to select users as well, without much extra work.
If you're not actively seeking to fully restore the data, but you'd still like to keep it around for monitoring (or similar) purposes. I would suggest that you figure out (to the extent possible of course) an aggregation scheme, and shove that off to another table. This will keep your primary table clean of 'deleted' data, as well as keep your secondary table optimized for monitoring purposes (or whatever you had in mind).
For temporal data, see: http://talentedmonkeys.wordpress.com/2010/05/15/temporal-data-in-a-relational-database/
Pros of using a delete flag:
You can get the data back later if you need it,
Delete operation (updating the flag) is probably quicker than really deleting it
Cons of using a delete flag:
It is very easy to miss AND DeletedFlag = 'N' somewhere in your SQL
Slower for the database to find the rows that you are interested in amongst all the crap
Eventually, you'll probably want to really delete it anyway (assuming your system is successful. What about when that record is 10 years old and it was "deleted" 4 minutes after originally created)
It can make it impossible to use a natural key. You may have one or more deleted rows with the natural key and a real row wanting to use that same natural key.
There may be legal/compliance reasons why you are meant to actually delete data.
As a complement to all posts...
However, if you plan to mark the record, its good to consider making a view, for active records. This would save you from writing or forgetting the flag in your SQL query. You might consider a view for non-active records too, if you think that also would serve some purpose.
I am glad to have found this thread. I too was wondering what people thought about this issue. I have implemented the 'marked as deleted' for about 15 years on many systems. Whenever a user would call to say something was accidentally deleted it was certainly a lot easier to mark it un-deleted than recreate it or restore from a backup.
We are using postgresql and Ruby on rails it looks like we could do this in 1 of two ways, modify rails or add an ondelete trigger and does instead a pl/pgsql function to mark as deleted. I am leaning toward the latter.
As for performance hits, it will be interesting to see the results of EXPLAIN-ANALYZE on large tables to few deleted items as well as many deleted items.
In systems used over time I have found, new users tend to do silly things like delete things accidentally. So when people are new in a position they have all the access rights of the person previously in that position except with zero experience. Accidentally deleting something and being able to quickly recover gets everyone back to work quickly.
But as someone said, sometimes you may need that particular key back for some reason, at that point you would need to really delete it, then re-create the records (on undelete it and modify the record).
I mark them as deleted, and don't really delete. However every once in a while I sweep out all the junk and archive it, so it doesn't kill performance.
There are also legal issues either way if personal data is involved. I think it greatly depends on where you are (or where the database is), and what the terms of use are.
In some cases people can ask to be removed from your system, in which case a hard delete is needed (or at least clearing out all of the personal information).
I would check with your legal department before you adopt a strategy either way if personal information is involved.
If you are concerned about "dormant" records slowing down your database access, you may want to move those rows into another table acting as an "archive" table.
For user-entered/managed data I've used the flag method you describe and given the user an "empty the trash" interface to actually delete items if they choose to.
I have a database with lots of dependencies. Hence, I cannot delete some records because others still depend on the data. This is what I usually do; I try to delete the data, if it works, I know it didn't have any dependencies and didn't matter. If it doesn't, I catch the error and flag it as inactive:
try
{
_context.SomeTable.Remove(someEntity);
await _context.SaveChangesAsync();
}
catch (DbUpdateException ex) when (ex.InnerException is SqlException && (ex.InnerException as SqlException).Number == 547)
{
// Mark as inactive
someEntity.Active = false;
await _context.SaveChangesAsync();
}

maintain history in a database

I am designing this database that must maintain a history of employee salary and the movements within the organization. Basically, my design has 3 tables (I mean, there more tables but for this question I'll mention 3, so bear with me). Employee table (containing the most current salary, position data, etc), SalaryHistory table (salary, date, reason, etc.) and MovementHistory(Title, Dept., comments). I'll be using Linq to Sql, so what I was thinking is that every time employee data is updated, the old values will be copied to their respective history tables. Is this a good approach? Should I just do it using Linq to SQL or triggers? Thanks for any help, suggestion or idea.
Have a look at http://www.simple-talk.com/sql/database-administration/database-design-a-point-in-time-architecture .
Basically, the article suggests that you have the following columns in the tables you need to track history for -
* DateCreated – the actual date on which the given row was inserted.
* DateEffective – the date on which the given row became effective.
* DateEnd – the date on which the given row ceased to be effective.
* DateReplaced – the date on which the given row was replaced by another row.
* OperatorCode – the unique identifier of the person (or system) that created the row.
DateEffective and DateEnd together tell you the time for which the row was valid (or the time for which an employee was in a department, or the time for which he earned a particular salary).
It is a good idea to keep that logic internal to the database: that's basically why triggers exist. I say this carefully, however, as there are plenty of reasons to keep it external. Often times - especially with a technology as easy as LINQ-to-SQL - it is easier to write the code externally. In my experience, more people could write that logic in C#/LINQ than could do it correctly using a trigger.
Triggers are fast - they're compiled! However, they're very easy to misuse and make your logic overcomplex to a degree that performance can degrade rapidly. Considering how simple your use case is, I would opt to use triggers, but that's me personally.
Triggers will likely be faster, and don't require a "middle man" to get the job done, eliminating at least one chance for errors.
Depending on your database of choice, you can just use one table and enable OID's on it, and add two more columns, "flag" and "previous". Never update this table, only insert. Add a trigger so that when a row is added for employee #id, set all records with employee #id to have a flag of "old" and set the new rows "previous" value to the previous row.
I think this belongs in the database for two reasons.
First, middle tiers come and go, but databases are forever. This year Java EJBs, next year .NET, the year after that something else. The data remains, in my experience.
Second, if the database is shared at all it should not have to rely on every application that uses it to know how to maintain its data integrity. I would consider this an example of encapsulation of the database. Why force knowledge and maintenance of the history on every client?
Triggers make your front-end easier to migrate to something else and they will keep the database consistent no matter how data is inserted/updated/removed.
Besides in your case I would write the salaries straight to the salary history - from your description I wouldn't see a reason why you should go the way via an update-trigger on the employee table.

Resources