Best performance approach to history mechanism?

Best performance approach to history mechanism? - sql-server

We are going to create History Mechanism for our changes in DB (DART in pic) via Triggers.
we have 600 tables.
Each record that will be changed - the trigger will insert the deleted one into XXX.
regarding to the XXX :
option 1 : clone each table in "Dart" DB and each table now will have a "sister table"
e.g. :
Table1 will have Table1_History
problems :
we will have 1200 tables
programmer can do mistakes by working on wrong tables...
option 2 : make a new DB (DART_2005 in pic) and the history tables will be there
option 3 : use linked server which stores the Db which will contain the history tables.
question :
1) which option gives the best performance ( I guess 3 is not - but is it 1 or 2 or same ?)
2) Does option 2 is acting like "linked server" ( in queries we will need to select from both DB's...)
3) What is the best practice approach ?

All three approaches are viable and have similar performance based on your network speed, but each one will cause you a lot of headaches on a system with many concurrent users.
Since your will be inserting/updating multiple tables in one transaction with a very different access pattern (source table is random, history table is sequential) you will end up with blocking and or deadlocks.
If the existing table schema can not be changed
If you want to have a history system in place driven by your database ideally you will queue your history updates to prevent blocking problems.
Fire a trigger on update of your table
The trigger will submit a message containing the information from the inserted/deleted tables to a SQL Server Service Broker Queue
An activation stored procedure can pull the information from the queue and write it to the appropriate history table
On failure, a new message is sent to an "error queue" where a retry mechanism can re-submit to the original queue (make sure to include a retry counter in the message)
This way your history updates will be non-blocking and can not get lost.
Note: when working with SQL Server Service broker, make sure you completely understand the "Poison message" concept.
If the existing table schema can be changed
When this is an option, I recommend working with a "Record versioning" system where every update will create a new record & your application will correctly query the most recent version of the data. To ensure proper performance the table can be partitioned to the keep the most recent version of the data in a partition and the older versions in an archive partition. (I usually have a field end_date or expiration_date which is set to 9999/12/31 for the currently valid record.)
This approach of course requires considerable code changes in your data model and the existing application which might be not very cost effective.

1 and 2 will have similar performance; option 3 might be faster, if you are currently limited by some resource on the database server (e.g. disk IO), and you have a very fast network available to you.
Option 1 will lead to longer back-up times for your DART database - this may be a concern.
In general, I believe that if your application domain needs the concept of "history", you should build it in as a first-class feature. There are several approaches to this - check out the links in question How to create a point in time architecture in MySQL.
Again, in general, I dislike the use of triggers for this kind of requirement. Your trigger either has to be very simple - in which case it's not always easy to use the data it creates in your history table - or it has to be smart, in which case your trigger does a lot of work, which may make evolving your database schema harder in future.

Related

Data import via Stored procedure or Triggers

We've a legacy system (MAS200 if you need to know) and there's an old vbs script which pull data from MAS and populates two staging tables in our production SQL database. And after some processing / cleanup that data goes into actual tables.
Data flow : MAS200 --> Staging tables --> Production table
To simplify consider there's an "Order" parent table and an "Items" child table. Order can have multiple items, each item record will have an FK OrderId. So, during import first we import the Order data and create an entry in the "Order" table and then fetch "Items" entries and import them.
Existing TRIGGER based approach -
At present we've two TRIGGERs - one on each staging table (Order & Items). So each new insert is tapped, and after processing data a new entry is inserted into actual production table. My only concern is that the trigger is executed for each Items entry instead of BULK insert. And it seems less manageable.
SP based approach -
If I remove both the TRIGGERs then import data into staging tables and finally execute an SP which will import Order data and then perform a BULK insert into the Items table. Could that be more efficient / faster?
Its not a comparison actually just a diff design. I'd like to know which one seems better or if there's a 3rd better approach to import from MAS to production SQL db.
EDIT 1 : Thanks. As asked by many - the data volume is not big or too frequent. Lets say 10-12 Orders (with 20-30 Items) every hour. Also with TRIGGERs, thought we don't get a TRANSACTION but only two simple TRIGGERs are suffice. I believe more scripting is needed with SP.
Goal : Need to keep it as simple, clean and efficient as possible.

Using Triggers:
Pros:
The data sync is real time. As you create data by data entry, the volume of data should not be big, so having bulk insert doesn't improve a lot. performance using trigger is good enough
Cons:
Data sync is not real time and if the connection breaks between MAS200 and production, you'll have a big problem. Also (as you mentioned) you can not have transaction, which is a big issue.
I suggest you use SP to transfer data in a time interval basis (if you can tolerate synchronization delay).

If you really want fast approach , you need :
1) to disable the FK on the ITEMS table for the duration of the load
2) then LOAD the ORDERS , and then enable the FK
All this should be done using SP , trigger approach is safe but very slow when its come to large bulks load
I hope you will find it useful
Thanks

How can I refer to a database deployed from the same DACPAC file, but with a different db name?

Background
I have a multi-tenant scenario and a unique Sql Server project that will be deployed into multiple databases instances on the same server. There will be one db for each tenant, plus one "model" db.
The "model" database serves three purposes:
Force some "system" data to be always present in each tenant database
Serves as an access point for users with a special permission to edit system data (which will be punctually synced to all tenants)
When creating a new tenant, the database will be copied and attached with a new name representing the tenant
There are triggers that checks if the modified / deleted data within tenant db corresponds to "system" data inside the "model" db. If it does, an error is raised saying that system data cannot be altered.
Issue
So here's a part of the trigger that checks if deletion can be allowed:
IF DB_NAME() <> 'ModelTenant' AND EXISTS
(
SELECT
[deleted].*
FROM
[deleted]
INNER JOIN [---MODEL DB NAME??? ---].[MySchema].[MyTable] [ModelTable]
ON [deleted].[Guid] = [ModelTable].[Guid]
)
BEGIN;
THROW 50000, 'The DELETE operation on table MyTable cannot be performed. At least one targeted record is reserved by the system and cannot be removed.', 1
END
I can't seem to find what should take the place of --- MODEL DB NAME??? --- in the above code that would allow the project to compile properly. When refering to a completely different project I know what to do: use a reference to the project that's represented with an SQLCMD variable. But in this scenario the project reference is essentially the same project; only on a different database. I can't seem to be able to add a self-reference in this manner.
What can I do? Does SSDT offers some kind of support for such a scenario?

Have you tried setting up a Database Variable? You can read under "Reference aware statements" here. You could then say:
SELECT * FROM [$(MyModelDb)][MySchema].[MyTable] [ModelTable]
If you don't have a specific project for $(MyModelDb) you can choose the option to "suppress errors by unresolved references...". It's been forever since I've used SSDT projects, but I think that should work.
TIP: If you need to reference 1-table 100-times, you may find it better to create a SYNONM that uses the database variable, then point to the SYNONM in your SPROCs/TRIGGERs. Why? Because that way you don't need to deploy your SPROCs/TRIGGERs to get the variable replaced with the actual value and that can make development smoother.

I'm not quite sure if SSDT is particularly well-suited to projects of any decent amount of complexity. I can think of one or two ways to most likely accomplish this (especially depending on exactly how you do the publishing / deployment), but I think you would actually lose more than you gain. What I mean by that is: you could add steps to get this to work (i.e. win the battle), but you would be creating a more complex system in order to get SSDT to publish a system that is more complex (and slower) than it needs to be (i.e. lose the war).
Before worrying about SSDT, let's look at why you need/want SSDT to do this in the first place. You have system data intermixed with tenant data, and you need to validate UPDATE and DELETE operations to ensure that the system data does not get modified, and the only way to identify data that is "system" data is by matching it to a home-of-record -- ModelDB -- based on GUID PKs.
That theory on identifying what data belongs to the "system" and not to a tenant is your main problem, not SSDT. You are definitely on the right track for a multi-tentant system by having the "model" database, but using it for data validation is a poor design choice: on top of the performance degradation already incurred from using GUIDs as PKs, you are further slowing down all of these UPDATE and DELETE operations by funneling them through a single point of contention since all client DBS need to check this common source.
You would be far better off to include a BIT field in each of these tables that mixes system and tenant data, denoting whether the row was "system" or not. Just look at the system catalog views within SQL Server:
sys.objects has an is_ms_shipped column
sys.assemblies went the other direction and has an is_user_defined column.
So, if you were to add an [IsSystemData] BIT NOT NULL column to these tables, your Trigger logic would become:
IF DB_NAME() <> N'ModelTenant' AND EXISTS
(
SELECT del.*
FROM [deleted] del
WHERE del.[IsSystemData] = 1
)
BEGIN
;THROW 50000, 'The DELETE operation on table MyTable cannot be performed. At least one targeted record is reserved by the system and cannot be removed.', 1;
END;
Benefits:
No more SSDT issue (at least for from this part ;-)
Faster UPDATE and DELETE operations
Less contention on the shared resource (i.e. ModelDB)
Less code complexity

As an alternative to referencing another database project, you can produce a dacpac, then reference the dacpac as a database reference in "same server, different database" mode.

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.

This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.

You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.

Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement

Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

Large Data Service Architecture

Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?

One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)

We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.

Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.

Can SQL Server Replication include the source dbid in the replicated data?

Let's say I have DatabaseA with TableA, which has these fields: Id, Name.
In another database, DatabaseB, I have TableA which has these fields: DatabaseId, Id, Name.
Is it possible to setup a replication publication that will send:
DatabaseA.dbid, DatabaseA.TableA.Id, DatabaseA.TableA.Name
to DatabaseB.TableA?
Edit:
The reason I'm asking is that I need to combine multiple databases (with identical schemas) into a single database, with as little latency as possible. Replication seemed like a good place to start (need to replicate data from one place to another), but I'm just in the brainstorming phase. I would definitely be open to suggestions on how to accomplish this without using replication.

There might be an easier way to do it, but the first thing I thought of is wrapping TableA in an indexed view on the source database and then replicating the view as a table (i.e., type = "indexed view logbased"). I don't think this would work with merge replication, though.
So, that would roughly be like:
CREATE VIEW TableA_with_dbid WITH SCHEMABINDING AS
SELECT DatabaseA.dbid, Id, Name FROM TableA
CREATE UNIQUE CLUSTERED INDEX ON TableA_with_dbid (Id) -- or whatever your PK is
EXEC sp_addarticle ...,
#source_object = 'TableA_with_dbid',
#destination_table = 'TableA',
#type = 'indexed view logbased',
...
Big caveat: indexed views have a lot of requirements that may not be appropriate for your application. For example, certain options have to be set any time you update the base table.
(In response to the edit in your question...) This won't work for combining multiple sources into one table. AFAIK, an object in a subscribing database can only come from one published article. And you can't do an indexed view on the subscribing side since UNION is not allowed in an indexed view. (The docs don't explicitly state UNION ALL is disallowed, but it wouldn't surprise me. You might try it just in case.) But it still does answer your explicit question: the dbid would be in the replicated table.

Are you aggregating these events in one place from multiple sources? Replicating only comes from one source - it's one-to-one, so the source ID doesn't seem like it would make much sense.
If you're aggregating data from multiple sources, maybe linked servers and triggers is a better choice, and if that's the case, then you could absolutely include any information about the source that you want.
If you can clarify your question to describe the purpose, it would help us find the best solution.
UPDATED FROM NEW DETAIL IN QUESTION:
Does this solution sound like it might be what you need?
Set up AFTER triggers on the source databases that send any changed rows to the central repository database, in some kind of holding table. These rows can include additional columns, like "Source", "Change type" (for insert, delete, etc).
Some central process watches the table and processes new rows (or runs periodically - once/minute, maybe), incorporating them into the central database
You could adjust how frequently the check/merge process runs on the server based on your needs (even running it constantly to handle new rows as they appear, perhaps even with an AFTER trigger on that table as well).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight