comparing MapReduce to cloud database services - database

Are the databases offered by cloud services such as Windows Azure SQL Database or AWS Big Data capable of distributed computing, in the sense that the query optimizer divides the work across servers which compute in parallel, similar to how MapReduce distributes computation across nodes?
I haven't found anything about any such query optimization in the Azure documentation, although PDW seems like it may do this.

AWS has EMR (Elastic Map-Reduce) which is Hadoop provisioned by AWS.
Azure has HDInsights which is Hortonwork's data platform (Hadoop) installed on Windows VMs.
Microsoft's PDW (parallel data warehouse) doesn't support map-reduce right as far as I know but they are working on it (http://www.zdnet.com/microsofts-polybase-mashes-up-sql-server-and-hadoop-7000007424/) - PDW is essentially a few SQL Server machines with a central management layer that allows partitioning and distribution of the data between the different nodes - it can and will break a query between the PDW nodes if the data resides on more than one but the parallelism is not map-reduce in nature.

Related

AWS EC2 - SQL server with multiple writers

I have a requirement where I need to see the possibility of having multiple read and write clusters for SQL Server deployed in Amazon EC2.
I understand that we can configure Always on High Availablity for SQL Servers which does support multiple read replicas.
My questions is, Is it possible to have multiple write replicas for a single database running in EC2?
We have multiple batch jobs running on single database and there are lot of load on this database, effecting the performance of application.
I am quite new to AWS and any help/pointers will be appreciated.
Regards,
Madhu

Advice on Azure platform to host Data Warehouse

I am a Data Warehouse developer currently looking into using the Azure platform to host a new Data Warehouse.
My experience is with using on premise servers hosting standard SQL Server Databases, one for the staging database and one for the Data Warehouse. Typically I would use a combination of SSIS and stored procedures running in a scheduled SQL server agent job for the ETL.
How can I replicate this kind of setup within Azure?
The storage size will be less than 1TB so could I just use Azure SQL Server Database over Azure SQL Data Warehouse?
If so would I need separate databases for staging and the data warehouse using the elastic pool option?
The data that I will be loading into staging will all be on premise. Will SSIS still be suitable for loading to Azure or will Azure Data Factory be a better fit?
Any help at all would be greatly appreciated! Thanks.
Leon has lots of good information there. But from a Data Warehouse perspective, I wouldn't use Data Sync for ETL purposes (mensioned as "not preferred" in the link Leon provided, Data Sync, in the list "When to use Data Sync").
For DW, Azure DB is a good option. Azure SQL Data Warehouse (known as Azure Synapse Analytics nowadays) is a heavy duty beast for handling DW. Are you really sure you need this kind of system with < 1Tb data? I'd personnally leave Azure Synaptics for now, and tried with Azure DB first. It's a LOT cheaper and you can upgrade later if necessary.
One thing to note about Azure DB though: Azure DB doesn't support queries over databases. That's not a deal breaker though, everything can be handled in the same database. I personally use a schema to differentiate staging from the DW (and of course I use other schemas in the DW as well). It's not very difficult to use separate databases of course, but the border between them is a lot deeper in Azure DB than on-premise SQL Server or other Azure solutions (Managed Instance for example).
SSIS is still an option, but the problem is, what you use to run the packages? There are options like:
continue running them from on-premise (all the hard work is still done in the cloud)
rent a VM with SQL Server from Azure, deploy the packages to the VM and run them from VM
use Data Factory to run the SSIS packages
None of those are a perfect solution for every use case. First two options come with quite a heavy cost, if running SSIS is the only thing you need them for. Using Data Factory to run SSIS is a bit cumbersome at the moment, but it's an option anyway.
Data Factory itself is a good option as well (I haven't personally tried it, but I have heard good things about it). If you use Data Factory to run your SSIS, why not start using Data Factory without SSIS packages in the first place? Of course Data Factory has some limitations compared to SSIS which might be the reason, but if your SSIS packages are simple enough, why not give Data Factory a try.
I would suggest you using Azure SQL database. It provides many price tier with difference storage for you. You can select the most suitable price tier for you. Azure SQL database also support scale up/down base on the usage.
Ref: Service tiers in the DTU-based purchase model
And as you said, the data that I will be loading into staging will all be on premise.
Azure SQL database has the feature Data Sync can help you do that:
Data Sync is useful in cases where data needs to be kept updated across several Azure SQL databases or SQL Server databases. Here are the main use cases for Data Sync:
Hybrid Data Synchronization: With Data Sync, you can keep data
synchronized between your on-premises databases and Azure SQL
databases to enable hybrid applications. This capability may appeal
to customers who are considering moving to the cloud and would like
to put some of their application in Azure.
Distributed Applications: In many cases, it's beneficial to separate
different workloads across different databases. For example, if you
have a large production database, but you also need to run a
reporting or analytics workload on this data, it's helpful to have a
second database for this additional workload. This approach minimizes
the performance impact on your production workload. You can use Data
Sync to keep these two databases synchronized.
Globally Distributed Applications: Many businesses span several
regions and even several countries/regions. To minimize network
latency, it's best to have your data in a region close to you. With
Data Sync, you can easily keep databases in regions around the world
synchronized.
When you create the SQL database, you can migrate the schema or data to Azure with many tools, such as Data Migration Assistant(DMA).
Then Set up SQL Data Sync between Azure SQL Database and SQL Server on-premises, it will help sync the data auto every 5 mins.
Hope this helps.
If you want to start on the less expensive options in Azure, go with a general purpose SQL database and an Azure Data Factory pipeline with a few activities.
Dynamic Resource Scaling ETL
You can scale up the database by issuing an alter database statement and then move onto your stored proc based ETL. I would even use a "master" proc to call the dimension and fact proc's to control the execution flow. Then scale down the database with another alter database statement. I even created my own stored proc to issue these scaling statements.
You also cannot predict when the scaling will be completed, so I have a wait activity. You could be a little more nerdy with a loop that checks the service objective property and then proceeds when it is complete. But it was just easier to wait for 10 minutes. I have only been burnt a couple times when the scaling took longer.
Data Pipeline Activities:
Scale up, proceed if successful
Wait about 10 minutes, proceed always
Execute the ETL, proceed always
Scale down
Elastic Query
You can query across databases with vertical partition Elastic Query. Performance isn't great, and they don't recommend it for ETL, but it will work. To improve performance try dumping any large table you need into a temp table and then transform the data locally.

SQL server high availability solutions in Amazon AWS

I am in the process of migrating to Amazon AWS and need a SQL server high availability solution. The current licence that I have is SQL standard 2016.
At this time Amazon does not support shared volumes for Windows instances. Therefore, I am not able to do a regular SQL cluster fail over solution. This is the one where if the entire server goes down the stand by server picks up the slack and continues writing to the same storage. My only option is high availability always on basic groups. As I am starting to get familiar with this feature I find it very maintenance intensive and can see it becoming a problem when dealing with thousands of databases. In my case I have about 5k databases mostly small in size 600mb or less each. My question is Amazon not a viable hosting environment for a full SQL fail over solution. Is the high availability always on basic groups one per database a viable solution?

Whats the best redundancy setup on AWS for SQL Server 2014

We're migrating our environment over to AWS from a colo facility. As part of that we are upgrading our 2 SQL Server 2005s to 2014s. The two are currently mirrored and we'd like to keep it that way or find other ways to make the servers redundant. # of transactions/server-use is light for our app - but it's in production, requires high availability, and, as a result, requires some kind of fail over.
We have already setup one EC2 instance and put SQL server 2014 on it (as opposed to using RDBMS for licensing reasons and are now exploring what to do next to achieve this.
What suggestions do people have to achieve the redundancy we need?
I've seen two options thus far from here and googling around. I list them below - we're very open to other options!
First, use RDBMS mirroring service, but I can't tell if that only applies if the principal server is also RDBMS - it also doesn't help with licensing.
Second, use multiple availability zones. What are the pros/cons of this versus using different regions altogether (e.g., bandwidth issues) etc? And does multi-AZ actually give redundancy (if AWS goes down in Oregon, for example, then doesn't everything go down)?
Thanks for the help!
The Multi-AZ capability of Amazon RDS (Relational Database Service) is designed to offer high-availability for a database.
From Amazon RDS Multi-AZ Deployments:
When you provision a Multi-AZ DB Instance, Amazon RDS automatically creates a primary DB Instance and synchronously replicates the data to a standby instance in a different Availability Zone (AZ). Each AZ runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. In case of an infrastructure failure (for example, instance hardware failure, storage failure, or network disruption), Amazon RDS performs an automatic failover to the standby, so that you can resume database operations as soon as the failover is complete. Since the endpoint for your DB Instance remains the same after a failover, your application can resume database operation without the need for manual administrative intervention.
Multiple Availability Zones are recommended to improve availability of systems. Each AZ is a separate physical facility such that any disaster that should befall one AZ should not impact another AZ. This is normally considered sufficient redundancy rather than having to run across multiple Regions. It also has the benefit that data can be synchronously replicated between AZs due to low-latency connections, while this might not be possible between Regions since they are located further apart.
One final benefit... The Multi-AZ capability of Amazon RDS can be activated by simply selecting "Yes" when the database is launched. Running your own database and using mirroring services requires you to do considerably more work on an on-going basis.

Columnar Database Comparisons and DBA efforts

I'm trying to find a database solution and I came across Infobright and Amazon Redshift as potential solutions. Both are columnar databases. Infobright has been around for quite sometime whereas Amazon Redshift is newer.
What is the DBA effort between Infobright and Amazon Redshift?
How accessible is Infobright (API, query interface, etc.) vs AWS?
Where do both sit in your system architecture? Do the operate as a layer on top of your traditional RDBMS?
What is the DevOps effort to setting up both Infobright and Redshift?
I'm leaning a bit more towards Redshift because my application is hosted on AWS and I thought this would create tangible benefits in the long-run since everything is in AWS. Thank you in advance!
Firstly, I'll admit that I work for Infobright. I've done significant research into Redshift, and I feel I can give an honest opinion. I just wrote up a comparison between the two technologies; it can be found here: https://www.infobright.com/wp-content/plugins/download-monitor/download.php?id=37
DBA Effort - Infobright requires very little administration. You cannot index; you don't need to partition/etc. It's an SMP architecture and scales well. Thus, you won't be dealing with multiple nodes. Redshift is also fairly simple. You will need to maintain sorts as well as ensure Analyse is run enough.
Infobright uses a MySQL Shell. Thus, any tool that can utilize MySQL can utilize Infobright. Therefore, you have the same set of tools/interfaces/APIs for Infobright as you do with MySQL. AWS does have an SQL interface, and it does have some API capabilities. It does require that you load directly from S3. Infobright loads from flat files and named pipes from local or remote servers.
Both databases are analytic databases. You would not want to use either as a transactional database. Instead, you typically push data from your transactional system to your analytic database.
DevOps to setup Infobright will be lower than Redshift. However, Redshift is not that overly complicated either. Maintenance of the environment is more of a requirement for Redshift, though.
Infobright does have many AWS-specific installations. In fact, we have implementations that approach nearly 100TB of raw storage on one server. That said, Redshift with many nodes can achieve petabyte scale on an implementation.
There are other factors that can impact your choice. For example, Redshift has very nice failover/HA options already built-in. On the flipside, Infobright can support many concurrent queries and users; Redshift limits queries to 15 regardless of cluster size.
Take a look at the document, and feel free to contact me if you have any specific questions about either technology.

Resources