Incrementally moving data across SQL Server instances without an ETL tool - sql-server

I am working on a project that involves regularly copying data from identically structured database A (on a SQL Server instance) to database B (on a different instance). I do not have the option of using SSIS or another ETL tool, so I have to use already established linked servers and write a stored procedure run as an hourly job for the incremental loads.
To copy the delta records from database A to B, I have the following stored procedure to be run as a job:
CREATE PROCEDURE Incremental_Load
AS
BEGIN
SET NOCOUNT ON;
INSERT INTO New_Table (New_Date_Column, Copied_Attribute)
SELECT Datetime_Var, Attribute1
FROM Linked_Server.Database_A.dbo.Source_Table
WHERE Datetime_Var > (SELECT MAX(New_Date_Column) FROM New_Table);
END
It seems to work; only the new records, not present in the new table are inserted, based on the last record by datetime. My question has to do with performance and potential issues. If the source table gets new records every 30 minutes, and this stored procedure is run hourly, should I expect any issues, especially with regard to the subquery? Also, is it possible I will not detect all deltas? Are there any other glaring issues with this approach?
In case it matters, I do not have access to SQL Server Profiler. I also cannot create jobs, because of my privileges/roles; Someone else must do that for me. So, I would like to collect some information in advance of testing.
Thank you.

Related

SQL Server: Archiving old data

I have a database that is getting pretty big, but the client is only interested in the last 2 years' data. But they would like to keep the older data "just-in-case".
Now we would like to archive the data to a different server over a WAN.
My plan is to create a stored proc to:
Copy all data from lookup tables, tables containing master data and foreign key tables over to the archive server.
Copy data from transactional tables over to the archive DB.
Delete transactional data from master db that's older than 2 years.
Although the approach will teoretically meet our needs, the 2 main problems are:
Performace: I'm copying the data over via SQL Linked Servers. Some of the big tables are really slow as it needs to compare which records exist and then update them, and the records that doesn't exists needs to be created. Seems like it will run in 3-4 hours.
We need to copy the tables in the correct sequence to prevent foreign key violations, and also the tables that have a relationship to itself (eg. Customers table with a ParentCustomer field), needs to be transferred without the ParentCustomer and then the ParentCustomer needs to be updated to prevent FK violations. Thus it becomes difficult to auto generate my Insert and Update statements (I would like to auto generate my statements as far as possible).
I just feel there might be a better way of archiving data that I do not yet know about. SSIS might be an option, but not sure if it will prevent my existing challenges. I don't know much about SSIS, so I might need to find some material to study it if that's the way to go.
I believe you need a batch process that will run as a scheduled task; perhaps every night. There are two options, which you have already discussed:
1) SQL Agent Job, which executes a Stored Procedure. The stored procedure will use Linked Servers.
2) SQL Agent Job, which will execute an SSIS package.
I believe you could benefit from a combination of both approaches, which would avoid Linked Serverd. Here are the steps:
1) An SQL Agent Job executes an SSIS package, which transfers the data to be archived from the live database to the copy database. This should be done in a specific sequence to avoid foreign key violations.
2) Once the SSIS package has executed the transfer, then it executes a stored procedure on the live database deleting the information that is over two years old. The stored procedure will not require any linked servers.
You will have to use transactions to make sure duplicate data is not archived. For example, if the SSIS package fails then the transaction should be rolled back and the Stored Procedure should not be executed.
You can use table partitions to create separate partitions for relevant date ranges.

Recreating indexes will improve performances

I have few tables (base tables) which are getting inserted and updated twice a week. I have indexes created on these tables long back.
I'm applying logic on top of these tables in a stored procedure (without any parameter) and creating a final output table.
I'm scheduling this stored procedure twice a week using SQL Server agent job.
It is running slowly now (50 minutes) whereas if I run the stored procedure manually, it is running faster (15 - 18 minutes)
Do I have to drop the indexes whenever insert or update is happening in base tables and recreate it again after the insert or update?
If so do I have to do it every week?
What is its effect in performance of SQL Server agent jobs?
Indexes do require maintenance, but the rate at which they do depends entirely on how much data is changed, and how those changes are ordered. You can google around for any number of scripts to check your index fragmentation, and how to defragment them. Usually even for larger databases, weekly or nightly maintenances are more than enough.
Anyway, the fact that the execution time differs depending on how you run it, points to two possible causes:
Parametrization, or the SET properties used by the connection.
If your procedure uses parameters but you run the script manually giving the parameters values as you do, then SQL Server knows exactly which values you're using, and can optimize the query execution to use the correct indexes etc on the spot. If your agent calls the procedure with the same parameters, then the process is different. SQL Server may not know which values are being used, so it has to use covering indexes or worse yet, even full on table scans (reading all the data in the whole table, rendering indexes useless) to make sure that it will find all the relevant data for the query. Google SQL Server parametrization, and you can find out more.
The set properties on the other hand control specific session properties that are applied automatically when you connect directly to the database via Management Studio. But when you use an agent job, that may not be the case. This can also result in a different plan which will take far more time.
Both these cases, depend on your database settings and the way your procedure works. So we have to guess here.
But typically, you need to set the following properties in the beginning of a script in an agent job to match the session properties used in your regular Management Studio session:
SET ANSI_NULLS ON;
GO
SET QUOTED_IDENTIFIER ON;
GO
All of the terms here can be googled. I suggest you do so. Those articles can explain these things far better than I've the time for, especially given that - no disrespect intended - you're relatively new to SQL Server. So explaining these things with a suitable terminology here, is difficult. :)

T-SQL: advise on copying data across to another database

I need advise on copying daily data to another server.
Just to give you an image of the situation, I will explain a little. there are workstations posting transactions to 2 database servers (DB1 and DB2). These db servers hosted on 2 separate physical servers and are linked. Daily transactions are 50.000 for now but will increase soon. There might be days some workstations down (operational but cannot post data) and transactions posted after a few days.
So, what I do is I run a query on those 2 linked servers. The daily query output contains ~50.000 records with minimum 15 minutes fetching time as linked servers have performance problems.I will create a SP and schedule it to run 2AM in the morning.
My concern starts from here, the output will be copied across to another data warehouse (DW). This is our client's special land, I do not know much about. This DW will be linked onto these db servers to make it possible to send the data (produced by my stored procedure) across.
Now, what would you do to copy the data across:
Create a dummy table on DB1 to copy stored procedure output on the same server so make sure it is available and we do not need to rerun stored procedure again. Then client retrieves it later.
Use "select into" statement to copy the content to remote DW table. I do not know what happens with this one during fetching and sending data across to DW. Remember it takes ~15 mins to fetch the data by my stored procedure.
post the data (retrieved by stored procedure) with xml file through ftp.
Please tell me if there is a way of setting an alert or notification on jobs.
I just want to take precautions so it will be easier to track when something goes wrong.
Any advice is appreciated very much. Thank you. Oz.
When it comes to coping data in SQL Server you need to look at High Availability Solutions, depending on the version and edition of your SQL Server you will have different options.
http://msdn.microsoft.com/en-us/library/ms190202(v=sql.105).aspx
If you need just to move data for specific tables you can have options like SSIS job or SQL Server Replication.
If you are looking to have all tables in a given databases copied to another server you should use Log Shipping. Which allows you to copy entire content of source database to another location. Because this is done of smaller interval the your load will be distributed over larger period of time instead of having large transaction running at once.
Another great alternative is SQL Server Replication. This option will capture transaction on the source and push them to the target. This model requires publisher (source), distributor (can be source or another db) and subscriber (target).
Also you can create SSIS job that runs on frequent basis and just moves specified amount of data.

Recording all Sql Server Inserts and Updates

How can I record all the Inserts and Updates being performed on a database (MS SQL Server 2005 and above)?
Basically I want a table in which I can record all the inserts andupdates issues on my database.
Triggers will be tough to manage because there are 100s of tables and growing.
Thanks
Bullish
We have hundreds of tables and growing and use triggers. In newer versions of SQL server you can use change Data Capture or Change Tracking but we have not found them adequate for auditing.
What we have is are two separate audit tables for each table (one for recording the details of the instance (1 row even if you updated a million records) and one for recording the actual old and new values), but each has the same structure and is created by running a dynamic SQL proc that looks for unauditied tables and creates the audit triggers. This proc is run every time we deploy.
Then you should also take the time to write a proc to pull the data back out of the audit tables if you want to restore the old values. This can be tricky to write on the fly with this structure, so it is best to have it handy before you have the CEO peering down your neck while you restore the 50,000 users accidentally deleted.
As of SQL Server 2008 and above you have change data capture.
Triggers, although unwieldy and a maintenance nightmare, will do the job on versions prior to 2008.

Copy Multiple Tables into ONE Table (From Multiple Databases)

I've got multiple identical databases (distributed on several servers) and need to gather them to one single point to do data mining, etc.
The idea is to take Table1, Table2, ..., TableN from each database and merge them and put the result into one single big database.
To be able to write queries, and to know from which database each row came from we will add a single column DatabaseID to target table, describing where the row came from.
Editing the source tables is not an option, it belongs to some proprietary software.
We've got ~40 servers, ~170 databases and need to copy ~40 tables.
Now, how should we implement this given that it should be:
Easy to setup
Easy to maintain
Preferably easy to adjust if database schema changes
Reliable, logging/alarm if something fails
Not too hard to add more tables to copy
We've looked into SSIS, but it seemed that we would have to add each table as a source/transformation/destination. I'm guessing it would also be quite tied to the database schema. Right?
Another option would be to use SQL Server Replication, but I don't see how to add the DatabaseID column to each table. It seems it's only possible to copy data, not modify it.
Maybe we could copy all the data into separate databases, and then to run a local job on the target server to merge the tables?
It also seems like a lot of work if we'd need to add more tables to copy, as we'd have to redistribute new publications for each database (manual work?).
Last option (?) is to write a custom application to our needs. Bigger time investment, but it'd at least do precisely what we'd like.
To make it worse... we're using Microsoft SQL Server 2000.
We will upgrade to SQL Server 2008 R2 within 6 months, but we'd like the project to be usable sooner.
Let me know what you guys think!
UPDATE 20110721
We ended up with a F# program opening a connection to the SQL Server where we would like the aggregated databases. From there we query the 40 linked SQL Servers to fetch all rows (but not all columns) from some tables, and add an extra row to each table to say which DatabaseID the row came from.
Configuration of servers to fetch from, which tables and which columns, is a combination of text file configuration and hard coded values (heh :D).
It's not super fast (sequential fetching so far) but it's absolutely manageable, and the data processing we do afterwards takes far longer time.
Future improvements could be to;
improve error handling if it turns out to be a problem (if a server isn't online, etc).
implement parallel fetching, to reduce the total amount of time to finish fetching.
figure out if it's enough to fetch only some of the rows, like only what's been added/updated.
All in all it turned out to be quite simple, no dependencies to other products, and it works well in practice.
Nothing fancy but couldn't you do something like
DROP TABLE dbo.Merged
INSERT INTO dbo.Merged
SELECT [DatabaseID] = "Database1", * FROM ServerA.dbo.Table
UNION ALL SELECT [DatabaseID] = "Database2", * FROM ServerB.dbo.Table
...
UNION ALL SELECT [DatabaseID] = "DatabaseX", * FROM ServerX.dbo.Table
Advantages
Easy to setup
Easy to maintain
Easy to adjust
Easy to add more tables
Disadvantages
Performance
Reliable logging
We had a similar requirement where we took a different approach. first created a central database to collect the data. Then we created a inventory table to store the list of target servers / databases. Then a small vb.net based CLR procedure which take the path of SQL query, target SQL Instance name and the target table which will store the data(This would eliminate the setup of linked server when new targets are added). This also adds two additional columns to the result set. The Target server name and the timestamp when the data is captured.
Then we set up a service broker queue/service and pushed list of target servers to interogate.
The above CLR procedure is wrapped in another procedure which dequeues the message, executes the SQL on the target server provided. The wrapper procedure is then configured as the activated procedure for the queue.
With this we are able to achieve a bit of parallelism to capture the data.
Advantages :
Easy to setup Easy to manage (Add / Remove targets)
Same framework works for multiple queries
Logging tables to check for failed queries.
Works independent of each target, so if one of the target fails to
respond, others still continue.
Workflow can be pause gracefully by disabling the queue (for
maintenance on central server) and then resume collection be
re-enabling it.
Disadvantage:
requires good understanding of service brokers.
should properly handle poison messages.
Please Let me know if it helps

Resources