Control Behavior of Constraints During SQL Bulk Insert? - sql-server

I am curious if there is a way to control how the database responds to foreign and primary key constraints while using the bulk insert command. Specifically, I am inserting data into an employees table (with primary key empID), that is linked to data in other tables (children, spouse, and so on) by that key (empID). Given the logic and purpose behind this application, if, during bulk insert, a duplicate primary key is found, I would like to do the following:
Delete (or update) the 'employee' row that already exists in the database.
Delete all data from other tables associated with that employee (that is children, beneficiaries, and so on).
Insert the new employee data.
If I was inserting data the usual way, that is without bulk insert, I think the task would be quite simple. However, as am using it, I must admit, I am not quite certain how to approach this.
I am certainly not expecting anybody to write my code for me, but I am not quite sure if it is even possible, how to begin, or what the best approach might be. A stored procedure perhaps, or changes to schema?
Thanks so much for any guidance.

The usual solution to this kind of problem is to load the data into a Staging table, perhaps in it's own Schema or even Database. You would then load the actual table from this staging table, allowing you to perform whatever logic is required in an un-restricted manner. This has the added benefit of letting you log/audit/check the logic you are using while loading the 'real' table.

Related

Historical, versioned and immutable data

We need to add data auditing to a project.
We could create some kind of Log or Audit table to record the changes in our SQL database. But would it not be a better idea to have the data in the database to be immutable. So, instead of updating existing values, rather add a new time-stamped row. This way ALL changes are tracked.
We are using the repository pattern, so this can provide a means to completely abstract this immutability/history/versioning away from client code. Our repositories consist of the basic CRUD operations (add, update, delete, find/gets). The following changes would need to occur:
Add: Insert with a new Identity and set the Timestamp.
Update: Insert with the old Identity value and set the Timestamp.
Delete: Insert with the old Identity, set the IsDeleted flag to true and set the Timestamp.
Find/Gets: Only return rows with the latest Timestamp values and where IsDeleted is false.
Other approaches:
Read from this post: Rather use two timestamp values, a start and an end data.
Instead of a timestamp, rather use some kind of IsLatest flag
My only gripe with the above is that, if the data had somehow got bad, multiple rows could be returned for a given date and time.
Is there any major flaw in this design or is there something I could have done differently? Is there perhaps a formalized approach the the above?
Is this somehow related to event sourcing?
My take on this:
You will lose the ability to create unique constraints on the data, except the identity columns.
Also, complicate FK handling. What happens when you update a parent row? It's the insert, thus new identity, but the child rows still reference the "old" record.
Performance will suffer.
I will advise to create a separate table for the archive. You can simplify the updates using the OUTPUT clause with UPDATE, and inserting into archive in same statement.
The approach you're describing is more appropriate for a DWH then an OLTP database.

PostgreSQL - table with foreign key column needs to be updated first

I have encountered a problem in my project using PostgreSQL.
Say there are two tables A and B, both A and B have a (unique) field named ID. The ID column of table A is declared as a primary key, while the ID column of table B is declared as a foreign key pointing back to table A.
My problem is that every time we have new data inputted into database, the values in table B tend to be updated prior to the ones in table A (this problem can not be avoided as the project is designed this way). So I have to modify the relationship between A and B.
My goal is to achieve a situation where I can insert data into A and B separately while having the ON DELETE CASCADE clause enabled. What's more, INSERT and DELETE queries may happen at the same time.
Any suggestions?
It sounds like you have a badly designed project, if you can't use deferred constraints. Your basic problem is that you can't guarantee internal consistency of the data because transactions may occur which do not move the data from one consistent state to another.
Here is what I would do to be honest:
Catalog affected keys.
Drop affected key constraints.
Write a periodic job that looks for orphaned rows. Use LEFT JOIN because antijoins do not perform as well in PostgreSQL.
The problem with a third table is it doesn't solve your basic problem, which is that writes are not atomically consistent. And once you sacrifice that a lot of your transactional controls go out the window.
Long term, the project needs to be rewritten.

Quickly update a large amount of rows in a table, without blocking inserts on referencing tables, in SQL Server 2008

Context:
I have a system that acts as a Web UI for a legacy accounting system. This legacy system sends me a large text file, several times a day, so I can update a CONTRACT table in my database (the file can have new contracts, or just updated values for existing contracts). This table currently has around 2M rows and about 150 columns. I can't have downtime during these updates, since they happen during the day and there's usually about 40 logged users in any given time.
My system's users can't update the CONTRACT table, but they can insert records in tables that reference the CONTRACT table (Foreign Keys to the CONTRACT table's ID column).
To update my CONTRACT table I first load the text file into a staging table, using a bulk insert, and then I use a MERGE statement to create or update the rows, in batches of 100k records. And here's my problem - during the MERGE statement, because I'm using READ COMMITED SNAPSHOT isolation, the users can keep viewing the data, but they can't insert anything - the transactions will timeout because the CONTRACT table is locked.
Question: does anyone know of a way to quickly update this large amount of rows, while enforcing data integrity and without blocking inserts on referencing tables?
I've thought about a few workarounds, but I'm hoping there's a better way:
Drop the foreign keys. - I'd like to enforce my data consistency, so this don't sound like a good solution.
Decrease the batch size on the MERGE statement so that the transaction is fast enough not to cause timeouts on other transactions. - I have tried this, but the sync process becomes too slow; Has I mentioned above, I receive the update files frequently and it's vital that the updated data is available shortly after.
Create an intermediate table, with a single CONTRACTID column and have other tables reference that table, instead of the CONTRACT table. This would allow me to update it much faster while keeping a decent integrity. - I guess it would work, but it sounds convoluted.
Update:
I ended up dropping my foreign keys. Since the system has been in production for some time and the logs don't ever show foreign key constraint violations, I'm pretty sure no inconsistent data will be created. Thanks to everyone who commented.

Database design

Our database is part of a (specialized) desktop application.
The primary goal is to keep data about certain events.
Events happen every few minutes.
The data collected about events changes frequently with new data groups being added in and old ones swapped out almost monthly (the data comes in definite groups).
I have to put together a database to track the events. A first stab at that might be to simply have a single big table where each row is an event and that is basically what our data looks like, but this seems undesirable because of our constantly changing groups of data (i.e. the number of columns would either keep growing perpetually or we would constantly having this months database incompatible with last months database - ugh!). Because of this I am leadning toward the following even though it creates circular references. (But maybe this is a stupid idea)
Create tables like
Table Events
Table Group of the Month 1
Table Group of the Month 2
...
Table Events has:
A primary key whose deletion cascade to delete rows with foreign keys referencing it
A nullable foreighn key for each data group table
Each data group table has:
A primary key, whose deletion cascades to null out foreign keys referencing it
Columns for the data in that group
A non-nullable foreign key back to the event
This still leaves you with a growing, changing Event Table (as you need to add new foreign key columns for each new data group), just much less drastically. However it seems more modular to me than one giant table. Is this a good solution to this situation? If not, what is?
Any suggestions?
P.S. We are using SQL Express or SQL Compact (we are currently experimenting with which one suits us best)
Why not use basically the single table approach and store the changing event data as XML in an XML column? You can even use XSD schemas to account for the changing data types, and you can add indexes on XML data if fast query performance on some XML data is required.
A permanently changing DB schema wouldn't really be an option if I were to implement such a database.
You should not have foreign keys for each data group table in the event table.
The event table already has an event_id which is in each data group table. So you can get from event to the child tables. Furthermore there will be old rows in the event table that are not in the latest data group table. So you really can't have a foreign key.
That said, I would wonder whether there is additional structure in the data group tables that can be used to clean up your design. Without knowing anything about what they look like I can't say. But if there is, consider taking advantage of it! (A schema that changes every month is a pretty bad code smell.)
Store your data at as granular a level as possible. It might be as simple as:
EventSource int FK
EventType int FK
Duration int
OccuredOn datetime
Get the data right and as simple as possible in the first place, and then
Aggregate via views or queries. Your instincts are correct about the ever changing nature of the columns - better to control that in T-SQL than in DDL.
I faced this problem a number of years ago with logfiles for massive armies of media players, and where I ultimately ended up was taking this data and creating a OLAP cube out of it. OLAP is another approach to database design where the important thing is optimizing it for reporting and "sliceability". It sounds like you're on that track, where it would be very useful to be able to look at a quick month's view of data, then a quarter's, and then back down to a week's. This is what OLAP is for.
Microsoft's technology for this is Analysis Services, which comes as part of Sql Server. If you didn't want to take the entire plunge (OLAP has a pretty steep learning curve), you could also look at doing a selectively denormalized database that you populated each night with ETL from your source database.
HTH.

Deleting Database Rows and their References-Best Practices

How do I go about deleting a row that is referenced by many other tables, either as a primary key or as a foreign key?
Do I need to delete each reference in the appropriate order, or is there an 'auto' way to perform this in, for example, linq to sql?
If you're performing all of your data access through stored procedures then your delete stored procedure for the master should take care of this. You need to maintain it when you add a new related table, but IMO that requires you to think about what you're doing, which is a good thing.
Personally, I stay away from cascading deletes. It's too easy to accidentally delete a slew of records when the user should have been warned about existing children instead.
Many times the best way to delete something in a database is to just "virtually" delete it by setting an IsDeleted column, and then ignoring the row in all other queries.
Deletes can be very expensive for heavily linked tables, and the locks can cause other queries to fail while the delete is happening.
You can just leave the "IsDeleted" rows in the system forever (which might be helpful for auditing), or go back and delete them for real when the system is idle.
if you have the foreign keys set with ON DELETE CASCADE, it'll take care of pruning your database with just DELETE master WHERE id = :x

Resources