Incremental data transfer using Azure Data Factory - sql-server

In an on-premises SQL Server database, I have a number of tables in to which various sales data for a chain of stores is inserted during the day. I would like to "harvest" these data to Azure every, say 15, minutes via Data Factory and an on-premises data management gateway. Clearly, I am not interested in copying all table data every 15 minutes, but only in copying the rows that have been inserted since last fetch.
As far as I can see, the documentation suggests using data "slices" for this purpose. However, as far as I can see, these slices require a timestamp (e.g. a datetime) column to exist on the tables where data is fetched from.
Can I perform a "delta" fetch (i.e. only fetch the rows inserted since last fetch) without having such a timestamp column? Could I use a sequential integer column instead? Or even have no incrementally increasing column at all?
Assume that the last slice fetched had a window from 08:15 to 08:30. Now, if the clock on the database server is a bit behind the Azure clock, it might add some rows with the timestamp being set to 08:29 after that slice was fetched, and these rows will not be included when the next slice (08:30 to 08:45) is fetched. Is there a smart way to avoid this problem? Shifting the slice window a few minutes into the past could minimize the risk, but not totally eliminate it.

Take Azure Data Factory out of the equation. How do you arrange for transfer of deltas to a target system? I think you have a few options:
add date created / changed columns to the source tables. Write parameterised queries to pick up only new or modified values. ADF supports this scenario with time slices and system variables. Re identity column, you could do that with a stored procedure (as per here) and a table tracking the last ID sent.
Engage Change Data Capture (CDC) on the source system. This will allow you to access deltas via the CDC functions. Wrap them in a proc and call with the system variables, similar to the above example.
Always transfer all data, eg to staging tables on the target. Use delta code EXCEPT and MERGE to work out what records have change; obviously not ideal for large volumes, this would work for small volumes.
HTH

We are planning to add this capability into ADF. It may start from sequential integer column instead of timestamp. Could you please let me know if the sequential integer column will help?

By enabling "Change Tracking" on SQL Server, you can leverage on the "SYS_CHANGE_VERSION " to incrementally load data from On-premise SQL Server or Azure SQL Database via Azure Data Factory.
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-tracking-feature-portal

If using SQL Server 2016, see https://msdn.microsoft.com/en-us/library/mt631669.aspx#Enabling-system-versioning-on-a-new-table-for-data-audit. Otherwise, you can implement the same using triggers.
And use NTP to synchronize your server time.

Related

SQL Server Complex Add/Update merge million rows daily

I have 7 reports which are downloaded daily at late night.
These reports can be downloaded in csv/xml. I am downloading them csv format as they are memory efficient.
This process runs in background and is managed by hangfire.
After they are downloaded, I am using dapper to run a stored procedure which insert/update/update data using merge statements. This stored procedure has seven table value parameters.
Instead of delete, I am updating that record's IsActive column to false.
Note that 2 reports have more than 1 million records.
I am getting timeout exceptions only in Azure SQL. In SQL Server, it works fine. As a workaround, I have increased timeouts to 1000 for this query.
This app is running in Azure s2.
I have pondered over the option of sending xml but I have found SQL Server is slow at processing xml which counter productive.
I can not also use SqlBulkCopy as I have to update based on some conditions.
Also note that more reports will be added in future.
Also when a new report is added then there are large amount of inserts. If previously added report is ran again then majority updates are run.
These tables currently do not have any indexes, only clustered integer primary key.
Each row has a unique code. This code is used to identify whether to insert/update/delete
Can you recommend a way to increase performance?
Is your source inputting the whole data? Whether they are updated/new. I assume by saying the unique code(insert/update/delete) you are only considering changes (Delta). If not that's one area. Another is to consider parallelism. I think then you need to have different stored procedures for each table. Non dependent tables could be processed together

Configuring SQL Server to automatically take database snapshot and use that instead of actual database

I have been allocated task of fetching data from database. However, as per requirement, I can't update parts of fetched data which are continuously updated in the database.
In other words, I just need to provide data instance up to a specific point in time. So I figured I could take snapshot of database and use that to show data to client which will always be consistent in the sense that updated records in actual database won't be reflected. What I exactly need is to take automatic snapshot for example after every hour, and then read from that. Is it possible in SQL Server? In Oracle I done same using RMAN, but I am lost in SQL Server.

insert data from different db server every second

Primary DB have all the raw data every 10 minutes, but it only store for 1 week. I would like to keep all the raw data for 1 year in another DB, and it is different server. How can it possible?
I have created T-query to select the required data from Primary DB. How can it keep update the data from primary DB and insert to secondary DB accordingly? The table has Datetime, would it able to insert new data for latest datetime?
Notes: source data SQL 2012
secondary db SQL 2005
If you are on sql2008 or higher the merge command (ms docs) may be very useful in your actual update process. Be sure to you understand it.
You table containing the full year data sounds like it could be OLAP, so I refer to it that way occasionally (if you don't know what OLAP is, look it up sometime, but it does not matter to this answer)
If you are only updating 1 or 2 tables, log shipping replication and failover may not work well for you, especially since you are not replicating the table due to different retention policies if nothing else. So make sure you understand how replication, etc. work before you go down that path. If these tables are over perhaps 50% of the total database, log shipping style methods might still be your best method. They work well and handle downtime issues for you -- you just replicate the source database to the OLAP server and then update from the duplicate database into your OLAP database.
Doing an update this every second is an unusual requirement. However, if you create a linked server, you be able to insert your selected rows into a staging table on the remote sever and them update from them to your OLAP table(s). If you can reliably update your OLAP table(s) on the remote server in 1 second, you have a potentially useful method. If not, you may fall behind on posting data to your OLAP tables. If you can update once a minute, you may find you are much less likely to fall behind on the update cycle (at the cost of being slightly less current at all times).
You want to consider putting after triggers on the source table(s) that copies the changes to a staging table (still on the source database) into staging table(s) with an identity on this staging table along with a flag to indicate Insert, Update or Delete and you are well positioned to ship updates for one or a few tables instead of the whole database. You don't need to requery your source database repeatedly to determine what data needs to be transmitted, just select top 1000 from from your staging table(s) (order by the staging id) and move them to the remote staging table.
If your fall behind, a top 1000 loop keeps from trying to post to much data in any one cross server call.
Depending on your data, you may be able to optimize storage and reduce log churn by not copying all columns to your staging table, just the staging id and the primary key of the source table and pretend that whatever data is in the source record at the time you post it to the OLAP database accurately reflects the data at the time the record was staged. It won't be 100% accurate on your OLAP table at all times, but it will be accurate eventually.
Cannot over emphasize that you need to accommodate the downtime in your design -- unless you can live with data loss or just wrong data. Even reliable connections are not 100% reliable.

Warehouse PostgreSQL database architecture recommendation

Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.

Copy Multiple Tables into ONE Table (From Multiple Databases)

I've got multiple identical databases (distributed on several servers) and need to gather them to one single point to do data mining, etc.
The idea is to take Table1, Table2, ..., TableN from each database and merge them and put the result into one single big database.
To be able to write queries, and to know from which database each row came from we will add a single column DatabaseID to target table, describing where the row came from.
Editing the source tables is not an option, it belongs to some proprietary software.
We've got ~40 servers, ~170 databases and need to copy ~40 tables.
Now, how should we implement this given that it should be:
Easy to setup
Easy to maintain
Preferably easy to adjust if database schema changes
Reliable, logging/alarm if something fails
Not too hard to add more tables to copy
We've looked into SSIS, but it seemed that we would have to add each table as a source/transformation/destination. I'm guessing it would also be quite tied to the database schema. Right?
Another option would be to use SQL Server Replication, but I don't see how to add the DatabaseID column to each table. It seems it's only possible to copy data, not modify it.
Maybe we could copy all the data into separate databases, and then to run a local job on the target server to merge the tables?
It also seems like a lot of work if we'd need to add more tables to copy, as we'd have to redistribute new publications for each database (manual work?).
Last option (?) is to write a custom application to our needs. Bigger time investment, but it'd at least do precisely what we'd like.
To make it worse... we're using Microsoft SQL Server 2000.
We will upgrade to SQL Server 2008 R2 within 6 months, but we'd like the project to be usable sooner.
Let me know what you guys think!
UPDATE 20110721
We ended up with a F# program opening a connection to the SQL Server where we would like the aggregated databases. From there we query the 40 linked SQL Servers to fetch all rows (but not all columns) from some tables, and add an extra row to each table to say which DatabaseID the row came from.
Configuration of servers to fetch from, which tables and which columns, is a combination of text file configuration and hard coded values (heh :D).
It's not super fast (sequential fetching so far) but it's absolutely manageable, and the data processing we do afterwards takes far longer time.
Future improvements could be to;
improve error handling if it turns out to be a problem (if a server isn't online, etc).
implement parallel fetching, to reduce the total amount of time to finish fetching.
figure out if it's enough to fetch only some of the rows, like only what's been added/updated.
All in all it turned out to be quite simple, no dependencies to other products, and it works well in practice.
Nothing fancy but couldn't you do something like
DROP TABLE dbo.Merged
INSERT INTO dbo.Merged
SELECT [DatabaseID] = "Database1", * FROM ServerA.dbo.Table
UNION ALL SELECT [DatabaseID] = "Database2", * FROM ServerB.dbo.Table
...
UNION ALL SELECT [DatabaseID] = "DatabaseX", * FROM ServerX.dbo.Table
Advantages
Easy to setup
Easy to maintain
Easy to adjust
Easy to add more tables
Disadvantages
Performance
Reliable logging
We had a similar requirement where we took a different approach. first created a central database to collect the data. Then we created a inventory table to store the list of target servers / databases. Then a small vb.net based CLR procedure which take the path of SQL query, target SQL Instance name and the target table which will store the data(This would eliminate the setup of linked server when new targets are added). This also adds two additional columns to the result set. The Target server name and the timestamp when the data is captured.
Then we set up a service broker queue/service and pushed list of target servers to interogate.
The above CLR procedure is wrapped in another procedure which dequeues the message, executes the SQL on the target server provided. The wrapper procedure is then configured as the activated procedure for the queue.
With this we are able to achieve a bit of parallelism to capture the data.
Advantages :
Easy to setup Easy to manage (Add / Remove targets)
Same framework works for multiple queries
Logging tables to check for failed queries.
Works independent of each target, so if one of the target fails to
respond, others still continue.
Workflow can be pause gracefully by disabling the queue (for
maintenance on central server) and then resume collection be
re-enabling it.
Disadvantage:
requires good understanding of service brokers.
should properly handle poison messages.
Please Let me know if it helps

Resources