Is there a way to replicate data from SQL Server to Hadoop similar to native transaction replication between two SQL Server Databases ?
I am not sure if Microsoft has devised such mechanism wherein the incremental data can be replicated from SQL Server to HAdoop at real time from SQL Server transaction logs.
Any response will be appreciated.
Same thing i am trying to do with CDC.
You can try Teland native CDC approach.
You can download Hortonworks – Talend sandbox from
http://www.talend.com/talend-big-data-sandbox
I don't know of a feature similar to what you're looking for but there are a few things you should consider:
If you're using plain Hadoop (HDFS+M/R) you should copy big chunks of data (64mb/128mb/256mb - generally speaking, the size of your HDFS blocks).
If you want realtime data insertion into Hadoop you should consider hbase (and that complicates things both on the IT level and the programming level).
In addition to data insertion, do you also want to replicate changes to data (i.e. update, delete)? If so, your only option would be hbase.
I would try to use CDC + code (either in CLR stored procedures or in SSIS) to implement such a mechanism.
Related
There is one postgres database in rds. We need to sync it to another postgres database in rds.
I have connected one database with pg admin but not sure how to synchronise the database with the other one.
Is this a recurring task or a one time task? If it's a one time thing, you can do a dump/backup and restore it to a new instance. If this is a recurring thing, then...
Why do you want to replicate the data? What is your actual goal? If it's for scaling, you should likely use the native RDS functionality for that. If it's for testing, then you will likely want to sanitize data as part of the copy.
As smac2020 said in their answer, AWS provides a migration service you can use for several use cases, but if this is a one-time thing, that is likely overkill. Just do a backup and restore that to another server.
You can also leverage change-data-capture to replicate data out, but that will likely take a lot to get working well and will only be worth it in certain circumstances. We leverage CDC to copy data to our warehouse, for example. Here is an example that streams data to Kinesis. Here is another example that streams to Kinesis using the migration service.
We have an Oracle OLTP system and a SQL Server reporting solution. We run nightly stored procedures to extract the data using linked server but it it very slow as it is transferring millions of records. It would be great to have some transactional replication to get this data into our SQL Server environment in a near real time manner for reporting purposes. Has anyone tried this? Without buying an expensive piece of software what would be the best bet?
We've looked into Apache Kafka to capture the changes and apply them to another database but it seems like a high maintenance approach.
If we switched on CDC on the Oracle side could we write a stored procedure to query the CDC table and make the changes?
Is there a simpler approach that we've overlooked?
We only have read access to the Oracle server but it is managed by a third party that could set up something on our behalf.
I am a Data Warehouse developer currently looking into using the Azure platform to host a new Data Warehouse.
My experience is with using on premise servers hosting standard SQL Server Databases, one for the staging database and one for the Data Warehouse. Typically I would use a combination of SSIS and stored procedures running in a scheduled SQL server agent job for the ETL.
How can I replicate this kind of setup within Azure?
The storage size will be less than 1TB so could I just use Azure SQL Server Database over Azure SQL Data Warehouse?
If so would I need separate databases for staging and the data warehouse using the elastic pool option?
The data that I will be loading into staging will all be on premise. Will SSIS still be suitable for loading to Azure or will Azure Data Factory be a better fit?
Any help at all would be greatly appreciated! Thanks.
Leon has lots of good information there. But from a Data Warehouse perspective, I wouldn't use Data Sync for ETL purposes (mensioned as "not preferred" in the link Leon provided, Data Sync, in the list "When to use Data Sync").
For DW, Azure DB is a good option. Azure SQL Data Warehouse (known as Azure Synapse Analytics nowadays) is a heavy duty beast for handling DW. Are you really sure you need this kind of system with < 1Tb data? I'd personnally leave Azure Synaptics for now, and tried with Azure DB first. It's a LOT cheaper and you can upgrade later if necessary.
One thing to note about Azure DB though: Azure DB doesn't support queries over databases. That's not a deal breaker though, everything can be handled in the same database. I personally use a schema to differentiate staging from the DW (and of course I use other schemas in the DW as well). It's not very difficult to use separate databases of course, but the border between them is a lot deeper in Azure DB than on-premise SQL Server or other Azure solutions (Managed Instance for example).
SSIS is still an option, but the problem is, what you use to run the packages? There are options like:
continue running them from on-premise (all the hard work is still done in the cloud)
rent a VM with SQL Server from Azure, deploy the packages to the VM and run them from VM
use Data Factory to run the SSIS packages
None of those are a perfect solution for every use case. First two options come with quite a heavy cost, if running SSIS is the only thing you need them for. Using Data Factory to run SSIS is a bit cumbersome at the moment, but it's an option anyway.
Data Factory itself is a good option as well (I haven't personally tried it, but I have heard good things about it). If you use Data Factory to run your SSIS, why not start using Data Factory without SSIS packages in the first place? Of course Data Factory has some limitations compared to SSIS which might be the reason, but if your SSIS packages are simple enough, why not give Data Factory a try.
I would suggest you using Azure SQL database. It provides many price tier with difference storage for you. You can select the most suitable price tier for you. Azure SQL database also support scale up/down base on the usage.
Ref: Service tiers in the DTU-based purchase model
And as you said, the data that I will be loading into staging will all be on premise.
Azure SQL database has the feature Data Sync can help you do that:
Data Sync is useful in cases where data needs to be kept updated across several Azure SQL databases or SQL Server databases. Here are the main use cases for Data Sync:
Hybrid Data Synchronization: With Data Sync, you can keep data
synchronized between your on-premises databases and Azure SQL
databases to enable hybrid applications. This capability may appeal
to customers who are considering moving to the cloud and would like
to put some of their application in Azure.
Distributed Applications: In many cases, it's beneficial to separate
different workloads across different databases. For example, if you
have a large production database, but you also need to run a
reporting or analytics workload on this data, it's helpful to have a
second database for this additional workload. This approach minimizes
the performance impact on your production workload. You can use Data
Sync to keep these two databases synchronized.
Globally Distributed Applications: Many businesses span several
regions and even several countries/regions. To minimize network
latency, it's best to have your data in a region close to you. With
Data Sync, you can easily keep databases in regions around the world
synchronized.
When you create the SQL database, you can migrate the schema or data to Azure with many tools, such as Data Migration Assistant(DMA).
Then Set up SQL Data Sync between Azure SQL Database and SQL Server on-premises, it will help sync the data auto every 5 mins.
Hope this helps.
If you want to start on the less expensive options in Azure, go with a general purpose SQL database and an Azure Data Factory pipeline with a few activities.
Dynamic Resource Scaling ETL
You can scale up the database by issuing an alter database statement and then move onto your stored proc based ETL. I would even use a "master" proc to call the dimension and fact proc's to control the execution flow. Then scale down the database with another alter database statement. I even created my own stored proc to issue these scaling statements.
You also cannot predict when the scaling will be completed, so I have a wait activity. You could be a little more nerdy with a loop that checks the service objective property and then proceeds when it is complete. But it was just easier to wait for 10 minutes. I have only been burnt a couple times when the scaling took longer.
Data Pipeline Activities:
Scale up, proceed if successful
Wait about 10 minutes, proceed always
Execute the ETL, proceed always
Scale down
Elastic Query
You can query across databases with vertical partition Elastic Query. Performance isn't great, and they don't recommend it for ETL, but it will work. To improve performance try dumping any large table you need into a temp table and then transform the data locally.
I have 2 servers. I need to copy some columns from 4 different tables from server 1 into the corresponding (empty) tables in server 2.
So basically, it's about replicating data from one table to another. How is this done best (and easiest)? Also, how do I make sure that the copied/replicated data is updated at the same frequency as the source (which runs completely fine and automatically)?
I want to avoid using Linked Server.
How is this done best (and easiest)?
For a one time replication consider a SQL Server Import and Export Wizard. This approach can also be scheduled by saving a final package and schedule it by SQL Agent
Example: Simple way to import data into SQL Server
For a continuous, low latency data syncronization - SQL Server Transactional Replication.
Further read: Tutorial: Configure replication between two fully connected servers (transactional)
Worth to mention, that transactional replication is not the easiest topic, however, it fits quite good to a requirement.
I have 2 databases that have the same structure, one on a local machine and one on the company's server. Every determined amount of time, the data from the local DB should be synchronized to the server DB.
I have a general idea on how to do this - create a script that somehow "merges" the information that is not on the server DB, then make this script run as a scheduled job for the server. However, my problem lies in the fact that I am not very well experienced with this.
Does SQL Server Management Studio provide an easy way to do this (some kind of wizard) and generates this kind of script? Is this something I'll have to build from scratch?
I've done some basic google searches and came across the term 'Replication' but I don't fully understand it. I would rather hear some input from people who have actually done this or who are good with explaining this kind of stuff.
Thanks.
Replication sounds like a good option for this, but there would be some overhead (not technical overhead, but the knowledge need to support it).
Another SQL Server option is SSIS. SSIS provides graphical tools to design what you're trying to do. The SSIS package can also run SQL statements, if appropriate. An SSIS package can be started, and therefore scheduled, from a SQL Server job.
You should consider the complexity of the synchronization rules when choosing your solution. For example, would it be difficult to resolve conflicts, such as a duplicate key, when merging the data. A SQL script may be easy to create if the rules are simple. But, complex conflict rules may be more difficult to implement in a script (or, replication).
SQL Server Management Studio unfortunately doesn't offer much in this way.
You should have a serious look at some of the excellent commercial offerings out there:
Red Gate Software's SQL Compare and SQL Data Compare - excellent tools, highly recommended! You can even compare a live database against a backup from another database and synchronize the data - pretty nifty!
ApexSQL's SQL Diff and SQL Data Diff
They all cost money - but if you're serious about it, and you use them in your daily routine, they're paid for in no time at all - well worth every dime.
The only "free" option you have in SQL Server 2008 would be to create a link between the two servers and then use something like the MERGE statement (new in SQL Server 2008) to transfer the data. That doesn't work for structural changes, and it's limited only to having a live connection between the two servers.
You should definitely read up on transactional replication. It sounds like a good fit for the situation you've described. Here are a few links to get you started.
How Transactional Replication
Works
How do I... Configure
transactional replication between two
SQL Server 2005 systems?
Performance Tuning SQL Server
Transactional Replication
What you want is Peer-to-Peer Transactional Replication, which allows data to be updated at both databases yet keep them in sync through a contiguous merge of changes. This is the closes match to what you want, but is a fairly costly option (requires Enterprise Edition on both sites). Another option is Bidirectional Transactional Replication, but since this requires also two EE licenses, I say that peer-to-peer is easier to deploy for the same money.
A more budget friendly option is Updatable Subscriptions for Transactional Replication, but updatable subscriptions are being deprecated and you'd bet your money on a loosing horse.
Another option is to use Merge Replication. And finally, for the cases when the 'local' database is quite mobile there is Sync Framework.
Note that all these options require some configuration and cooperation from the Company's server DB.
There are some excellent third party tools out there. For me, xSQL Data Compare has always done the trick. And because the comparisons are highly modifiable it is suitable for almost every data compare or data-synchronization scenario. Hope this helps!