How to improve performance when deleting entities from database? - sql-server

I started an ASP.NET project with Entity Framework 4 for my DAL, using SQL Server 2008. In my database, I have a table Users that should have many rows (5.000.000 for example).
Initially I had my Users table designed like this:
Id uniqueidentifier
Name nvarchar(128)
Password nvarchar(128)
Email nvarchar(128)
Role_Id int
Status_Id int
I've modified my table, and added a MarkedForDeletion column:
Id uniqueidentifier
Name nvarchar(128)
Password nvarchar(128)
Email nvarchar(128)
Role_Id int
Status_Id int
MarkedForDeletion bit
Should I delete every entity each time, or use the MarkedForDeletion attribute. This means that I need to update the value and at some moment in time to delete all users with the value set to true with a stored procedure or something similar.
Wouldn't the update of the MarkedForDeletion attribute cost the same as a delete operation?

Depending on the requirements/needs/future needs of your system, consider moving your 'deleted' entities over to a new table. Setup an 'audit' table to hold those that are deleted. Consider the case where someone wants something 'restored'.
To your question on performance: would the update be the same cost as a delete? No. The update would be a much lighter operation, especially if you had an index on the PK (errrr, that's a guid, not an int). The point being that an update to a bit field is much less expensive. A (mass) delete would force a reshuffle of the data. Perhaps that job belongs during a downtime or a low-volume period.
Regarding performance: benchmark it to see what happens! Given your table with 5 million rows, it'd be nice to see how your SQL Server performs, in its current state of indexes, paging, etc, with both scenarios. Make a backup of your database, and restore into a new database. Here you can sandbox as you like. Run & time the scenarios:
mass delete vs.
update a bit or smalldatetime field vs.
move to an audit table
In terms of books, try:
this answer re: books
a recommendation for Adam Mechanic's book
another question on database books.

This may depend on what you want to do with the information. For instance, you may want to mark a user for deletion but not delte all his child records (say something like forum posts), inthsi case you should markfor deletion or use a delted date field. If you do this, create a view to use for all active users (called ActiveUsers) , then insist that the view beused in any query for login or where you only want to see the active users. That will help prevent query errors from when you forget to exlude the inactive ones. If your system is active, do not make this change without going through and adjusting all queries that need to use the new view.
Another reason to use the second version is to prevent slowdowns when delting large numbers of child records. They no longer need to be deleted if you use a deleted flag. This can help performance becasue less resources are needed. Additionally you can flag records for deltion and then delte them inthe inthe middle of the night (or move to a history table) to keep the main tables smaller but still not affect performance during peak hours.

Related

SQL Delete vs Update

I have seen something like this asked a number of times but not quite in this configuration. I have a table that has a one to many relation.
Let’s say I have a computer table and a parts table. The user enters a generic info in the computer table then selects parts that are stored in the parts table with a relationship to the computer table of computerId. So the original write is a simple insert. Now let’s say the user select the computer again and changes the part on the pc, adds some new, removes some, and updates a few. Then the user hits save to save the changes. I run a simple update on the computer table but now the issue with the parts table.
Would it be better to delete all the records from the parts table for the computer Id and then do a clean insert of all the parts selected.
Or Run some method that would look at the existing parts in the table and where the part has been updated update the record, where the part no longer exists do a delete, and then insert the remaining parts?
Clearly the simple solution is to delete all and then insert all.
The down side of this SQL traffic, locks, and table fragmentation.
If it is small table and only few concurrent users then fine.
In a high volume environment I do the following
There is no update - that is just an ignore
- delete items gone
- ignore any items not changed
- insert new items
And you can do that in one pass two/three statements.
Or you could define a stored procedure.
Do the delete before the insert to clear space first.
You can get real fancy and use an update for delete / insert but that just gets more complex than it is worth in my mind. You would still have an insert or a delete if the item count is not the same.
delete comp_part
where compID = #compID and partID not in (....);
Insert is a little more tricky:
You can to it with a series of inserts and if you have a PK just let the insert fail
The other way is to create a #table and use it for both the delete and insert
This is only worth the hassle if you have a REALLY busy table.
It all depends upon the business model, if you would want to track the transaction than its not a good option to delete it. If you have all your old transactions with your customers than it would be beneficial for tracking purposes., Your CustomerID would be Primarykey and you can have another Unique key as PartOrderID which will be a unique value for each insert.
Hope this helps
Really you should have three tables. Product, Part, and ProductPart; the ProductPart table would store the association of "this product has these parts". As far as updating, the simplest thing would be to delete all ProductParts for a given Product and re-insert the records you want.

Audit fields(CreatedBy, UpdatedBy) in tables. Is it good idea?

I was working with one product where almost every table had those columns. As developers we constantly had to join to Users table to get Id of who created record and it's just a mess in a code.
I'm designing new product and thinking about this again. Does it have to be like this? Obviously, it is good to know who created record and when. But having 300+ tables reference same User table doesn't seem to be very good..
How do you handle things like this? Should I create CreatedBy column only on major entities where it's most likely needed on UI and than deal with joining? Or should I go and put it everywhere? Or maybe have another "Audit" table where I store all this and look it up only on demand(not every time entity displayed on UI)
I'm just worrying about performance aspect where every UI query will hit User table..
EDIT: This is going to be SQL Server 2008 R2 database
The problem with that approach is that you only know who created the row and who changed the row last. What if the last person to update the row was correcting the previous updater's mistake?
If you're interested in doing full auditing for compliance or accountability reasons, you should probably look into SQL Server Audit. You can dictate which tables you're auditing, can change those on the fly without having to mess with your schema, and you can write queries against this data specifically instead of mixing the auditing logic with your normal application query logic (never mind widening every row of the table itself). This will also allow you to audit SELECT queries, which other potential solutions (triggers, CDC, Change Tracking - all of which are either more work or not complete for true auditing purposes) won't let you do that.
I know that this is an older post, but one way to avoid the lookup on the user table is to de-normalize the audit fields.
So instead of a userid in the CreatedBy field you insert a username itself. This will allow for a review of the table without the user look and also allow for any changes in your user table not reflect in the audit fields. Such as deleted users.
I usually add the following to the end of a table
IsDeleted bit default 0
CreatedBy varchar(20)
CreatedOn datetime2 default getdate()
UpdatedBy varchar(20)
UpdatedOn datetime2 default getdate()

What is the best way to keep changes history to database fields?

For example I have a table which stores details about properties. Which could have owners, value etc.
Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.
What I thought was keeping a single table with fields
table_name, field_name, prev_value, current_val, time, user.
But it looks kind of hacky and ugly. Is there a better design?
Thanks.
There are a few approaches
Field based
audit_field (table_name, id, field_name, field_value, datetime)
This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.
Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).
Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.
The structure of such records will most likely need to be transformed to be used.
Record based
audit_table_name (timestamp, id, field_1, field_2, ..., field_n)
For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.
This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.
Log file
The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).
A different way to look at this is to time-dimension the data.
Assuming your table looks like this:
create table my_table (
my_table_id number not null primary key,
attr1 varchar2(10) not null,
attr2 number null,
constraint my_table_ak unique (attr1, att2) );
Then if you changed it like so:
create table my_table (
my_table_id number not null,
attr1 varchar2(10) not null,
attr2 number null,
effective_date date not null,
is_deleted number(1,0) not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );
You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.
Unreason:
You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.
The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.
Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.
In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.
Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.
The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.
In short, it's a design choice; I'm not sure either's right or either's wrong.
In our projects we usually do it this way:
You have a table
properties(ID, value1, value2)
then you add table
properties_audit(ID, RecordID, timestamp or datetime, value1, value2)
ID -is an id of history record(not really required)
RecordID -points to the record in original properties table.
when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.
After that you have latest value in properties and all the history(previous values) in properties_audit.
I think a simpler schema would be
table_name, field_name, value, time, userId
No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.

indexing pros cons in sql server 2008

I am working on social networking site. now our team decide to store user profile in denormalized manner. so our table structure is like this
here attribute it means one fields for user profile e.g. Firstname,LastName,BirthDate etc...
and groups means name of a group of fields e.g. Personal Details, Academic Info, Achievements etc..
**
Attribute/Groups master - it creates
hierarchy of groups and attributes.
**
Attribute_GroupId bigint
ParentId bigint
Attribute_GroupName nvarchar(1000)
ISAttribute bit
DisplayName nvarchar(1000)
DisplaySequence int
**
Attribute Control Info - stores which
control have to be populated at run
time for the attribute as well as its
validation criteria...
**
Attribute_ControlInfoId bigint
AttributeId bigint
ControlType nvarchar(1000)
DataType nvarchar(1000)
DefaultValue nvarchar(1000)
IsRequired bit
RegulareExpression nvarchar(1000)
**
And finally Attribute Values where for
every attribute , user wise values
will be stored
**
AttributeId bigint Checked
IsValueOrRefId bit Checked
Value nvarchar(MAX) Checked
ReferenceDataId bigint Checked
UserId bigint Checked
Unchecked
Now they are saying that we'll create index on Attribute Values table. there is no primary key also there.
AS there's huge data going to be stored in this table. e.g. if there are 50 million users and 30 attributes are there it'll store 1500 million records. in this case if we create index on table, isn't Insert and Update statement will be very slow as well as at time of data fetching for one user. quires will also be very slow.
i thought one option for that like instead of attribute wise values i can store one XML record for one user.
so, please can anybody help me out to find out best option for this case. how should i store data?
here i can not make hard code table because at any time new fields can be added by administrator so i need some data structure where i can easily add any fields in user profile with 1-2 steps only.
please reply me if anybody has better solution for this.
You guys need a dba!
This is one of those EAV tables that is going to bite you down the road!
Bill Karwin (his blog) put together a SQL Anti-patterns PPT
Link 1
Link 2
He offers 3 alternate solution to EAV.
Indexing is the least of your worries...
Check out those articles which highlight just how bad that design choice is, and what potential problems you're getting yourself into if you stick to that design:
Five Simple Database Design Errors You Should Avoid
Joe Celko: Avoiding the EAV of Destruction
Bad CaRMa
It seems to be a fairly common design problem - and it seems like a good idea to programmers to solve it that way, with a attribute/value table - but it's really not a good idea from a database performance point of view.
Also:
Now they are saying that we'll create
index on Attribute Values table. there
is no primary key also there.
As some SQL gurus like to say: "If it doesn't have a primary key, it's not a table".
You definitely need to find a way to get a primary key onto your tables - if you don't have anything that you can use per se, add a column "ID" of type "INT IDENTITY(1,1)" to it and put the primary key on that column. You need a primary key! Database design, first lesson, first five minutes....
You need to rethink your design and come up with something more clever to store the data you need.

How do you avoid adding timestamp fields to your tables? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a question regarding the two additional columns (timeCreated, timeLastUpdated) for each record that we see in many solutions. My question: Is there a better alternative?
Scenario: You have a huge DB (in terms of tables, not records), and then the customer comes and asks you to add "timestamping" to 80% of your tables.
I believe this can be accomplished by using a separate table (TIMESTAMPS). This table would have, in addition to the obvious timestamp column, the table name and the primary key for the table being updated. (I'm assuming here that you use an int as primary key for most of your tables, but the table name would most likely have to be a string).
To picture this suppose this basic scenario. We would have two tables:
PAYMENT :- (your usual records)
TIMESTAMP :- {current timestamp} + {TABLE_UPDATED, id_of_entry_updated, timestamp_type}
Note that in this design you don't need those two "extra" columns in your native payment object (which, by the way, might make it thru your ORM solution) because you are now indexing by TABLE_UPDATED and id_of_entry_updated. In addition, timestamp_type will tell you if the entry is for insertion (e.g "1"), update (e.g "2"), and anything else you may want to add, like "deletion".
I would like to know what do you think about this design. I'm most interested in best practices, what works and scales over time. References, links, blog entries are more than welcome. I know of at least one patent (pending) that tries to address this problem, but it seems details are not public at this time.
Cheers,
Eduardo
While you're at it, also record the user who made the change.
The flaw with the separate-table design (in addition to the join performance highlighted by others) is that it makes the assumption that every table has an identity column for the key. That's not always true.
If you use SQL Server, the new 2008 version supports something they call Change Data Capture that should take away a lot of the pain you're talking about. I think Oracle may have something similar as well.
Update: Apparently Oracle calls it the same thing as SQL Server. Or rather, SQL Server calls it the same thing as Oracle, since Oracle's implementation came first ;)
http://www.oracle.com/technology/oramag/oracle/03-nov/o63tech_bi.html
I have used a design where each table to be audited had two tables:
create table NAME (
name_id int,
first_name varchar
last_name varchar
-- any other table/column constraints
)
create table NAME_AUDIT (
name_audit_id int
name_id int
first_name varchar
last_name varchar
update_type char(1) -- 'U', 'D', 'C'
update_date datetime
-- no table constraints really, outside of name_audit_id as PK
)
A database trigger is created that populates NAME_AUDIT everytime anything is done to NAME. This way you have a record of every single change made to the table, and when. The application has no real knowledge of this, since it is maintained by a database trigger.
It works reasonably well and doesn't require any changes to application code to implement.
I think I prefer adding the timestamps to the individual tables. Joining on your timestamp table on a composite key -- one of which is a string -- is going to be slower and if you have a large amount of data it will eventually be a real problem.
Also, a lot of the time when you are looking at timestamps, it's when you're debugging a problem in your application and you'll want the data right there, rather than always having to join against the other table.
One nightmare with your design is that every single insert, update or delete would have to hit that table. This can cause major performance and locking issues. It is a bad idea to generalize a table like that (not just for timestamps). It would also be a nightmare to get the data out of.
If your code would break at the GUI level from adding fields you don't want the user to see, you are incorrectly writing the code to your GUI which should specify only the minimum number of columns you need and never select *.
The advantage of the method you suggest is that it gives you the option of adding other fields to your TIMESTAMP table, like tracking the user who made the change. You can also track edits to sensitive fields, for example who repriced this contract?
Logging record changes in a separate file means you can show multiple changes to a record, like:
mm/dd/yy hh:mm:ss Added by XXX
mm/dd/yy hh:mm:ss Field PRICE Changed by XXX,
mm/dd/yy hh:mm:ss Record deleted by XXX
One disadvantage is the extra code the will perform inserts into your TIMESTAMPS table to reflect changes in your main tables.
If you set up the time-stamp stuff to run off of triggers, than any action that can set off a trigger (Reads?) can be logged. Also there might be some locking advantages.
(Take all that with a grain of salt, I'm no DBA or SQL guru)
Yes, I like that design, and use it with some systems. Usually, some variant of:
LogID int
Action varchar(1) -- ADDED (A)/UPDATED (U)/DELETED (D)
UserID varchar(20) -- UserID of culprit :)
Timestamp datetime -- Date/Time
TableName varchar(50) -- Table Name or Stored Procedure ran
UniqueID int -- Unique ID of record acted upon
Notes varchar(1000) -- Other notes Stored Procedure or Application may provide
I think the extra joins you will have to perform to get the Timestamps will be a slight performance hit and a pain the neck. Other than that I see no problem.
We did exactly what you did. It is great for the object model and the ability to add new stamps and differant types of stamps to our model with minimal code. We were also tracking the user that made the change, and a lot of our logic was heavily based on these stamps. It woked very well.
One drawback is reporting, and/or showing a lot of differant stamps on on screen. If you are doing it the way we did it, it caused a lot of joins. Also,back ending changes was a pain.
Our solution is to maintain a "Transaction" table, in addition to our "Session" table. UPDATE, INSERT and DELETE instructions are all managed through a "Transaction" object and each of these SQL instruction is stored in the "Transaction" table once it has been successfully executed on the database. This "Transaction" table has other fields such as transactiontType (I for INSERT, D for DELETE, U for UPDATE), transactionDateTime, etc, and a foreign key "sessionId", telling us finally who sent the instruction. It is even possible, through some code, to identify who did what and when (Gus created the record on monday, Tim changed the Unit Price on tuesday, Liz added an extra discount on thursday, etc).
Pros for this solution are:
you're able to tell "what who and when", and to show it to your users! (you'll need some code to analyse SQL statements)
if your data is replicated, and replication fails, you can rebuild your database through this table
Cons are
100 000 data updates per month mean 100 000 records in Tbl_Transaction
Finally, this table tends to be 99% of your database volume
Our choice: all records older than 90 days are automatically deleted every morning
Philippe,
Don't simply delete those older than 90 days, move them first to a separate DB or write them to text file, do something to preserve them, just move them out of the main production DB.
If ever comes down to it, most often it is a case of "he with the most documentation wins"!

Resources