Transactional replication from Oracle to SQL Server - sql-server

We have an Oracle OLTP system and a SQL Server reporting solution. We run nightly stored procedures to extract the data using linked server but it it very slow as it is transferring millions of records. It would be great to have some transactional replication to get this data into our SQL Server environment in a near real time manner for reporting purposes. Has anyone tried this? Without buying an expensive piece of software what would be the best bet?
We've looked into Apache Kafka to capture the changes and apply them to another database but it seems like a high maintenance approach.
If we switched on CDC on the Oracle side could we write a stored procedure to query the CDC table and make the changes?
Is there a simpler approach that we've overlooked?
We only have read access to the Oracle server but it is managed by a third party that could set up something on our behalf.

Related

Advice on Azure platform to host Data Warehouse

I am a Data Warehouse developer currently looking into using the Azure platform to host a new Data Warehouse.
My experience is with using on premise servers hosting standard SQL Server Databases, one for the staging database and one for the Data Warehouse. Typically I would use a combination of SSIS and stored procedures running in a scheduled SQL server agent job for the ETL.
How can I replicate this kind of setup within Azure?
The storage size will be less than 1TB so could I just use Azure SQL Server Database over Azure SQL Data Warehouse?
If so would I need separate databases for staging and the data warehouse using the elastic pool option?
The data that I will be loading into staging will all be on premise. Will SSIS still be suitable for loading to Azure or will Azure Data Factory be a better fit?
Any help at all would be greatly appreciated! Thanks.
Leon has lots of good information there. But from a Data Warehouse perspective, I wouldn't use Data Sync for ETL purposes (mensioned as "not preferred" in the link Leon provided, Data Sync, in the list "When to use Data Sync").
For DW, Azure DB is a good option. Azure SQL Data Warehouse (known as Azure Synapse Analytics nowadays) is a heavy duty beast for handling DW. Are you really sure you need this kind of system with < 1Tb data? I'd personnally leave Azure Synaptics for now, and tried with Azure DB first. It's a LOT cheaper and you can upgrade later if necessary.
One thing to note about Azure DB though: Azure DB doesn't support queries over databases. That's not a deal breaker though, everything can be handled in the same database. I personally use a schema to differentiate staging from the DW (and of course I use other schemas in the DW as well). It's not very difficult to use separate databases of course, but the border between them is a lot deeper in Azure DB than on-premise SQL Server or other Azure solutions (Managed Instance for example).
SSIS is still an option, but the problem is, what you use to run the packages? There are options like:
continue running them from on-premise (all the hard work is still done in the cloud)
rent a VM with SQL Server from Azure, deploy the packages to the VM and run them from VM
use Data Factory to run the SSIS packages
None of those are a perfect solution for every use case. First two options come with quite a heavy cost, if running SSIS is the only thing you need them for. Using Data Factory to run SSIS is a bit cumbersome at the moment, but it's an option anyway.
Data Factory itself is a good option as well (I haven't personally tried it, but I have heard good things about it). If you use Data Factory to run your SSIS, why not start using Data Factory without SSIS packages in the first place? Of course Data Factory has some limitations compared to SSIS which might be the reason, but if your SSIS packages are simple enough, why not give Data Factory a try.
I would suggest you using Azure SQL database. It provides many price tier with difference storage for you. You can select the most suitable price tier for you. Azure SQL database also support scale up/down base on the usage.
Ref: Service tiers in the DTU-based purchase model
And as you said, the data that I will be loading into staging will all be on premise.
Azure SQL database has the feature Data Sync can help you do that:
Data Sync is useful in cases where data needs to be kept updated across several Azure SQL databases or SQL Server databases. Here are the main use cases for Data Sync:
Hybrid Data Synchronization: With Data Sync, you can keep data
synchronized between your on-premises databases and Azure SQL
databases to enable hybrid applications. This capability may appeal
to customers who are considering moving to the cloud and would like
to put some of their application in Azure.
Distributed Applications: In many cases, it's beneficial to separate
different workloads across different databases. For example, if you
have a large production database, but you also need to run a
reporting or analytics workload on this data, it's helpful to have a
second database for this additional workload. This approach minimizes
the performance impact on your production workload. You can use Data
Sync to keep these two databases synchronized.
Globally Distributed Applications: Many businesses span several
regions and even several countries/regions. To minimize network
latency, it's best to have your data in a region close to you. With
Data Sync, you can easily keep databases in regions around the world
synchronized.
When you create the SQL database, you can migrate the schema or data to Azure with many tools, such as Data Migration Assistant(DMA).
Then Set up SQL Data Sync between Azure SQL Database and SQL Server on-premises, it will help sync the data auto every 5 mins.
Hope this helps.
If you want to start on the less expensive options in Azure, go with a general purpose SQL database and an Azure Data Factory pipeline with a few activities.
Dynamic Resource Scaling ETL
You can scale up the database by issuing an alter database statement and then move onto your stored proc based ETL. I would even use a "master" proc to call the dimension and fact proc's to control the execution flow. Then scale down the database with another alter database statement. I even created my own stored proc to issue these scaling statements.
You also cannot predict when the scaling will be completed, so I have a wait activity. You could be a little more nerdy with a loop that checks the service objective property and then proceeds when it is complete. But it was just easier to wait for 10 minutes. I have only been burnt a couple times when the scaling took longer.
Data Pipeline Activities:
Scale up, proceed if successful
Wait about 10 minutes, proceed always
Execute the ETL, proceed always
Scale down
Elastic Query
You can query across databases with vertical partition Elastic Query. Performance isn't great, and they don't recommend it for ETL, but it will work. To improve performance try dumping any large table you need into a temp table and then transform the data locally.

Snapshot Replication or Backup/Restore

A SQL Server that serves as a DW for reporting purposes (OLAP) is also used directly by users to perform direct ad-hoc queries. User queries add a lot of extra load on this server because of number and concurrency and since this is not its primary role I am considering replicating the relatively big database to another machine and have users execute their queries on that machine to offload the DW.
This replication should be done daily.
My question is: what is the faster solution for this - Doing a Backup/Restore of the original DB or setting up Snapshot Replication ?
Is there any other approach to this problem - any suggestions?

SQL Server to Hadoop Replication

Is there a way to replicate data from SQL Server to Hadoop similar to native transaction replication between two SQL Server Databases ?
I am not sure if Microsoft has devised such mechanism wherein the incremental data can be replicated from SQL Server to HAdoop at real time from SQL Server transaction logs.
Any response will be appreciated.
Same thing i am trying to do with CDC.
You can try Teland native CDC approach.
You can download Hortonworks – Talend sandbox from
http://www.talend.com/talend-big-data-sandbox
I don't know of a feature similar to what you're looking for but there are a few things you should consider:
If you're using plain Hadoop (HDFS+M/R) you should copy big chunks of data (64mb/128mb/256mb - generally speaking, the size of your HDFS blocks).
If you want realtime data insertion into Hadoop you should consider hbase (and that complicates things both on the IT level and the programming level).
In addition to data insertion, do you also want to replicate changes to data (i.e. update, delete)? If so, your only option would be hbase.
I would try to use CDC + code (either in CLR stored procedures or in SSIS) to implement such a mechanism.

Upgrade SQL Server 2000 to 2008 R2 with replication

I have been looking into this project for a side-by-side upgrade solution. The most widely suggested/used solution is to do a full back of SQL Server 2000 database and restore on SQL Server 2008 with norecovery. Then restore the subsequent transaction log backups with norecovery. When we are ready to switch, change SQL Server 2000 to read-only mode, backup the tail-log and restore it on SQL Server 2008 with recovery. Then bring SQL Server 2008 online.
But, can't the upgrade be done with transactional replication where SQL Server 2000 is the publisher and SQL Server 2008 is the subscriber. Script all objects such as logins, indexes, etc and apply to SQL Server 2008. When we are ready to switch, we will stop replication, delete all replication jobs, and switch all apps to connect to SQL Server 2008. I haven't found anyone that suggests this method. Is there anything wrong with it?
The method of data migration you describe is possible to perform using SQL Server Replication.
There is nothing wrong with this method or any other data migration method for that matter, so long as the choice you decide upon addresses the specific requirements of your project/application platform.
That said the method you describe is certainly more technically involved in both the configuration and implementation of the actual migration steps. If you can accept downtime, a simple backup and restore process is certainly going to be much more straight forward. Log shipping would also be another simpler migration method.
So far, you know that the replication method could work in theory. Now is the time to build out a working solution in test in order to validate your data migration strategy and to practice the implementation process.
If you aren't replicating otherwise, creating a replication subscription will change your schema and a few settings.
For example, you may end up with GUIDs generated for all your rows just to facilitate the replication.
Caution - transactional replication will turn off all IDENTITY columns at the subscriber (the transactional replication SPs actually depend on this fact, as they insert into the IDENTITY columns without first specifying IDENTITY_INSERT ON). I can only confirm this is the case when the subscriber is SQL 2000 as well - perhaps the subscriber on 2008 will behave differently.
For this reason, transactional replication with SQL 2K doesn't really give you a hot standby. We had to do a fair bit of SQL tweaking (re-instating the IDENTITY columns & re-writing the replication SPs with IDENTITY_INSERT wrappers) to get ourselves a situation where the subscriber actually works as a hot standby, ready to have applications pointed at it. But it certainly wouldn't work out of the box =)
Yes, it will work, provided that you transfer the other objects over.

How to synchronize databases in different servers in SQL Server 2008?

I have 2 databases that have the same structure, one on a local machine and one on the company's server. Every determined amount of time, the data from the local DB should be synchronized to the server DB.
I have a general idea on how to do this - create a script that somehow "merges" the information that is not on the server DB, then make this script run as a scheduled job for the server. However, my problem lies in the fact that I am not very well experienced with this.
Does SQL Server Management Studio provide an easy way to do this (some kind of wizard) and generates this kind of script? Is this something I'll have to build from scratch?
I've done some basic google searches and came across the term 'Replication' but I don't fully understand it. I would rather hear some input from people who have actually done this or who are good with explaining this kind of stuff.
Thanks.
Replication sounds like a good option for this, but there would be some overhead (not technical overhead, but the knowledge need to support it).
Another SQL Server option is SSIS. SSIS provides graphical tools to design what you're trying to do. The SSIS package can also run SQL statements, if appropriate. An SSIS package can be started, and therefore scheduled, from a SQL Server job.
You should consider the complexity of the synchronization rules when choosing your solution. For example, would it be difficult to resolve conflicts, such as a duplicate key, when merging the data. A SQL script may be easy to create if the rules are simple. But, complex conflict rules may be more difficult to implement in a script (or, replication).
SQL Server Management Studio unfortunately doesn't offer much in this way.
You should have a serious look at some of the excellent commercial offerings out there:
Red Gate Software's SQL Compare and SQL Data Compare - excellent tools, highly recommended! You can even compare a live database against a backup from another database and synchronize the data - pretty nifty!
ApexSQL's SQL Diff and SQL Data Diff
They all cost money - but if you're serious about it, and you use them in your daily routine, they're paid for in no time at all - well worth every dime.
The only "free" option you have in SQL Server 2008 would be to create a link between the two servers and then use something like the MERGE statement (new in SQL Server 2008) to transfer the data. That doesn't work for structural changes, and it's limited only to having a live connection between the two servers.
You should definitely read up on transactional replication. It sounds like a good fit for the situation you've described. Here are a few links to get you started.
How Transactional Replication
Works
How do I... Configure
transactional replication between two
SQL Server 2005 systems?
Performance Tuning SQL Server
Transactional Replication
What you want is Peer-to-Peer Transactional Replication, which allows data to be updated at both databases yet keep them in sync through a contiguous merge of changes. This is the closes match to what you want, but is a fairly costly option (requires Enterprise Edition on both sites). Another option is Bidirectional Transactional Replication, but since this requires also two EE licenses, I say that peer-to-peer is easier to deploy for the same money.
A more budget friendly option is Updatable Subscriptions for Transactional Replication, but updatable subscriptions are being deprecated and you'd bet your money on a loosing horse.
Another option is to use Merge Replication. And finally, for the cases when the 'local' database is quite mobile there is Sync Framework.
Note that all these options require some configuration and cooperation from the Company's server DB.
There are some excellent third party tools out there. For me, xSQL Data Compare has always done the trick. And because the comparisons are highly modifiable it is suitable for almost every data compare or data-synchronization scenario. Hope this helps!

Resources