Advice on Azure platform to host Data Warehouse

Advice on Azure platform to host Data Warehouse - sql-server

I am a Data Warehouse developer currently looking into using the Azure platform to host a new Data Warehouse.
My experience is with using on premise servers hosting standard SQL Server Databases, one for the staging database and one for the Data Warehouse. Typically I would use a combination of SSIS and stored procedures running in a scheduled SQL server agent job for the ETL.
How can I replicate this kind of setup within Azure?
The storage size will be less than 1TB so could I just use Azure SQL Server Database over Azure SQL Data Warehouse?
If so would I need separate databases for staging and the data warehouse using the elastic pool option?
The data that I will be loading into staging will all be on premise. Will SSIS still be suitable for loading to Azure or will Azure Data Factory be a better fit?
Any help at all would be greatly appreciated! Thanks.

Leon has lots of good information there. But from a Data Warehouse perspective, I wouldn't use Data Sync for ETL purposes (mensioned as "not preferred" in the link Leon provided, Data Sync, in the list "When to use Data Sync").
For DW, Azure DB is a good option. Azure SQL Data Warehouse (known as Azure Synapse Analytics nowadays) is a heavy duty beast for handling DW. Are you really sure you need this kind of system with < 1Tb data? I'd personnally leave Azure Synaptics for now, and tried with Azure DB first. It's a LOT cheaper and you can upgrade later if necessary.
One thing to note about Azure DB though: Azure DB doesn't support queries over databases. That's not a deal breaker though, everything can be handled in the same database. I personally use a schema to differentiate staging from the DW (and of course I use other schemas in the DW as well). It's not very difficult to use separate databases of course, but the border between them is a lot deeper in Azure DB than on-premise SQL Server or other Azure solutions (Managed Instance for example).
SSIS is still an option, but the problem is, what you use to run the packages? There are options like:
continue running them from on-premise (all the hard work is still done in the cloud)
rent a VM with SQL Server from Azure, deploy the packages to the VM and run them from VM
use Data Factory to run the SSIS packages
None of those are a perfect solution for every use case. First two options come with quite a heavy cost, if running SSIS is the only thing you need them for. Using Data Factory to run SSIS is a bit cumbersome at the moment, but it's an option anyway.
Data Factory itself is a good option as well (I haven't personally tried it, but I have heard good things about it). If you use Data Factory to run your SSIS, why not start using Data Factory without SSIS packages in the first place? Of course Data Factory has some limitations compared to SSIS which might be the reason, but if your SSIS packages are simple enough, why not give Data Factory a try.

I would suggest you using Azure SQL database. It provides many price tier with difference storage for you. You can select the most suitable price tier for you. Azure SQL database also support scale up/down base on the usage.
Ref: Service tiers in the DTU-based purchase model
And as you said, the data that I will be loading into staging will all be on premise.
Azure SQL database has the feature Data Sync can help you do that:
Data Sync is useful in cases where data needs to be kept updated across several Azure SQL databases or SQL Server databases. Here are the main use cases for Data Sync:
Hybrid Data Synchronization: With Data Sync, you can keep data
synchronized between your on-premises databases and Azure SQL
databases to enable hybrid applications. This capability may appeal
to customers who are considering moving to the cloud and would like
to put some of their application in Azure.
Distributed Applications: In many cases, it's beneficial to separate
different workloads across different databases. For example, if you
have a large production database, but you also need to run a
reporting or analytics workload on this data, it's helpful to have a
second database for this additional workload. This approach minimizes
the performance impact on your production workload. You can use Data
Sync to keep these two databases synchronized.
Globally Distributed Applications: Many businesses span several
regions and even several countries/regions. To minimize network
latency, it's best to have your data in a region close to you. With
Data Sync, you can easily keep databases in regions around the world
synchronized.
When you create the SQL database, you can migrate the schema or data to Azure with many tools, such as Data Migration Assistant(DMA).
Then Set up SQL Data Sync between Azure SQL Database and SQL Server on-premises, it will help sync the data auto every 5 mins.
Hope this helps.

If you want to start on the less expensive options in Azure, go with a general purpose SQL database and an Azure Data Factory pipeline with a few activities.
Dynamic Resource Scaling ETL
You can scale up the database by issuing an alter database statement and then move onto your stored proc based ETL. I would even use a "master" proc to call the dimension and fact proc's to control the execution flow. Then scale down the database with another alter database statement. I even created my own stored proc to issue these scaling statements.
You also cannot predict when the scaling will be completed, so I have a wait activity. You could be a little more nerdy with a loop that checks the service objective property and then proceeds when it is complete. But it was just easier to wait for 10 minutes. I have only been burnt a couple times when the scaling took longer.
Data Pipeline Activities:
Scale up, proceed if successful
Wait about 10 minutes, proceed always
Execute the ETL, proceed always
Scale down
Elastic Query
You can query across databases with vertical partition Elastic Query. Performance isn't great, and they don't recommend it for ETL, but it will work. To improve performance try dumping any large table you need into a temp table and then transform the data locally.

Related

How to replicate Production MSSQL DB into Test db on a daily basis

I'm using RDS SQL Server database for my live environment.
I'd need to fully replicate the database into another RDS database on a fixed frecuency (maybe daily). The goal is to have test environment always provisioned with latest real data (cualitative and cuantitative) in order to make tests meaninfull.
How can I get this done in AWS?
NOTE: db footprint about 20Gb

You can create a SQL Server Integration Services project that will copy the entire database from one environment to the other. You can schedule it as a Job. You should investigate using the Transfer Database Task.

Your tests should not use production data. Not only is it a waste of processing/money, it also doesn't have any justification for it and is bad practice. You should know what your production data looks like and have a test table with that data, and only update that data if there is a change to how you process the data.

Load balancer and multiple instance of database design

The current single application server can handle about 5000 concurrent requests. However, the user base will be over millions and I may need to have two application servers to handle requests.
So the design is to have a load balancer to hope it will handle over 10000 concurrent requests. However, the data of each users are being stored in one single database. So the design is to have two or more servers, shall I do the followings?
Having two instances of databases
Real-time sync between two database
Is this correct?
However, if so, will the sync process lower down the performance of the servers
as Database replication seems costly.
Thank you.

You probably want to think of your service in "tiers". In this instance, you've got two tiers; the application tier and the database tier.
Typically, your application tier is going to be considerably easier to scale horizontally (i.e. by adding more application servers behind a load balancer) than your database tier.
With that in mind, the best approach is probably to overprovision your database (i.e. put it on its own, meaty server) and have your application servers all connect to that same database. Depending on the database software you're using, you could also look at using read replicas (AWS docs) to reduce the strain on your database.
You can also look at caching via Memcached / Redis to reduce the amount of load you're placing on the database.
So – tl;dr – put your DB on its own, big, server, and spread your application code across many small servers, all connecting to that same DB server.

Best option could be the synchronizing the standby node with data from active node as cost effective solution since it can be achievable using open source relational database(e.g. Maria DB).
Do not store computable results and statistics that can be easily doable at run time which may help reduce to data size.
If history data is not needed urgent for inquiries , it can be written to text file in easily importable format to database(e.g. .csv).
Data objects that are very oftenly updated can be kept in in-memory database as key value pair, use scheduled task to perform batch update/insert to relation database to achieve persistence
Implement retry logic for database batch update tasks to handle db downtimes or network errors
Consider writing data to relational database as serialized objects
Cache configuration data to memory from database either periodically or via API to refresh the changing part.

copy Azure SQL database (PaaS) to IaaS (SQL server on VM)

Is it possible to use Create Database [] as copy of [] to create a copy of database that is hosted as Azure SQL database (PaaS) towards IaaS (SQL server on VM)?
Can you recommend an alternative of Import/Export that can limit the downtime of such transition?
Reason for this migration is the restriction of cross databases queries in PaaS mode that complicate one-time migration towards new database used in newer application version process

The answer depends on whether you want to copy database schema, data, or both.
As Jaxidian said, ApexSQL tools can do the job but as far I know DataDiff will only synchronize database data, while Diff will synchronize schema.
Here is the article describing processes of copying database data:
https://solutioncenter.apexsql.com/how-to-automatically-synchronize-the-data-in-two-sql-server-databases-on-a-schedule/
If you want to copy both schema and data, process is described here:
https://solutioncenter.apexsql.com/how-to-automatically-compare-and-synchronize-multiple-databases-on-different-sql-server-instances/

There are lots of tools available that can accomplish this. Which one is best for you depends on your needs. However, the "Copy" feature in the Azure Portal will not accomplish this for you but can be a partial solution to the approach you finalize on.
I'll make the following assumptions:
You have an always-on 24/7 production load so there are no regularly/nightly/weekly/monthly maintenance windows
You can schedule a maintenance window but you wish to keep it as small as possible
You can easily configure your applications' connectionstrings
Your database isn't huge. Gigabytes is fine.
Your database is mostly static data (i.e. an incremental approach is much faster than a dump-and-fill)
If I were to do this today/right now, my approach would be like this (this is only one option):
Use the Copy feature to make a copy of the database that I can use this as a staging area/reference point while minimizing the load on the Production database
Create a backup (bacpac file) from the copied database
Restore the bacpac file onto your IaaS-hosted SQL Server to form your base deployment
Start your maintenance window and effectively put your database into read-only mode so the data is now no longer changing (lots of strategies on how to do this whether you turn applications off, revoke permissions, etc.)
Use a tool such as ApexSQL Data Diff (Redgate and others have options) to compare data between the two databases and sync the data over to the new IaaS DB. Be careful - depending on your data needs you may have to tweak the generated scripts that sync the data.
Verify that the new DB is now indeed a duplicate copy of your old DB (ApexSQL Data Diff can also help with this - several options exist here)
Change connectionstrings on your apps to point to the new DB
Turn applications back on and end your maintenance window.
So of course, if you do something like this, practice it numerous times and test the results numerous times well before your maintenance window. Get a good idea of the timing for everything, especially how long it will take for you to generate and restore the bacpac file. This is because you want to do that as late as possible before your maintenance window to minimize the time it takes to generate and run the final "Data Diff" script that you'll use. The longer that script takes, the longer your outage will be.

Options for a secondary SQL database

I have a VM in Azure running a single SQL Server instance.
I also have recently setup Power BI to refresh from this source at 1am every morning. Unfortunately, this refresh is causing performance issues, where all queries/operations are timing out due to stress.
What are my options regarding a secondary DB for reporting purposes? Main requirements are ease of maintenance and cost (dont need anything enterprise level).
Things that come to mind:
Secondary DB on same VM. Use replication to mirror data
Another cheap VM. Use replication
Use sql server availability sets, connect to read only replica
SQL data warehouse
Can anyone provide some guidance, or ask questions that may help find my answer?
Thanks.

I think Always ON availability group with secondary read-only replica will be best suited for your needs.
Building a separate DW for reporting purpose will be an overkill, as your reporting needs are satisfied from current database already, except for performance.
Transactional replication could be of help here. But, it also needs lot of knowledge on setup and maintenance.

I can think of several options, but in general this sounds like a canonical OLTP vs. OLAP issue, or a call for data warehouse, but since you are on the budget, let's consider low cost options.
Assuming the databases are small (GBs not TBs), I would separate operational and reporting instances either to be on the same machine if it is a pretty beefy machine, or better have two VMs so you can manage capacity separately.
I would consider replication from one instance to another.

Can you boost your VM resources during the period of the Power BI refresh only?
That's one of the key benefits of Azure - you can scale up and down and save money. How long does the refresh take? Who is using your DB at 1am?
I guess for a VM it's difficult to do this so you'd need to migrate to SQL Azure rather than a VM

SQL Server High Availability on premise - cloud

I would like to know which is the best way to make a copy and keep the copies synchronized of a on premises SQL Server 2008 (not R2) database to SQL Azure.
Think of the SQL Azure as a failover kind of structure...
Notes:
The database runs fine in SQL Azure
I have already figured out how to get the rest of the app running on Azure
Please consider suggestions of the type "Upgrade to SQL Server 2012 because of X" if the gain (reliability, efficiency, time to replicate, etc...) are worth it
I`m looking for instant replication (as fast as possible)
Yes it will have to sync back eventually. If the on-premises deploy crash and the cloud get activated and changed, sync back will be necessary, but i think it does not need to be automatic... of it is, better!
The Database consist of 900+ tables (legacy system)

http://www.windowsazure.com/en-us/manage/services/sql-databases/getting-started-w-sql-data-sync/
http://msdn.microsoft.com/en-us/library/hh456371.aspx
I think the best bet is to use SQL Data Sync, it should give you bidirectional and we use it currently to sync data around the world in terms of datacenters and one local on premise database. It will only give you 5 mins sync timing but this will probably do, otherwise the next best options is to use SQL Server VMs and do the old fashion way. But with SQL Azure Data Sync we have found to be reasonable reliable and been running it for a good six months syncing across 4 database in four data centres in Azure.
Some problems though with it,
It uses Triggers.
It will obivously add load and connections to your current SQL Database.
The new control panel in Azure is a nightmare for it, so I would use the old panel for the moment.
It is in preview last time I looked, so it might not be 100% suitable
for you.
I would imagine there is some better third party solutions out there but off the shelf and in Azure SQL Data sync is well worth a look for the situation you a describing.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight