Improving availability for Azure SQL? - sql-server

What's a good strategy to maintain availability with Azure SQL? We've noticing way too many service interruptions with messages like
which basically kill our application entirely. The SLA is far from 99.9% and honestly we're not interested in getting a refund, just reliable availability so our customers don't experience app outages. We're actually having better uptimes with a single IaaS VM running SQL than relying on Azure SQL (which totally blows our minds)
Anyway, leaving all those observations behind, programmatically, what is the most economical approach to having better than the advertised 99.99% availability (say an order of magnitude better - 99.999% - 5 mins downtime/year) with Azure SQL? Any specific data access programming pattern(s) and operating procedures anyone recommends?
EDIT: We're already using the Microsoft EntLib 6.0 Transient Fault Handling Application Block library ... 10 retries with 100ms inter-attempt timings. However, it's not 'transient' when these outages are 5+ hours long ...

Have a look at the passive and active geo-replication features of SQL Database -
http://azure.microsoft.com/blog/2014/07/12/spotlight-on-sql-database-active-geo-replication/
Active geo-replication creates additional copies of the database seamlessly in another data centre you can fall back to in case of an issue in a particular data centre

We use stovepiped solutions in each datacenter with an external load-balancer/failover solution. URLs are created to monitor data center faults and failover to another region.
In other words, we built our own business continuity solution to guarantee 5x9s.

Related

Can Snowflake be used to mitigate application failure for business continuity?

I would like your opinions or experiences around the following possible solution idea. I know Snowflake is primarily a data analytics platform. But why could we not use it for some creative scenarios like business continuity?
Problem
Imagine an application that supports a critical business process. There is a risk that the application could become unavailable for an extended period. The application in this case is a SaaS solution by a reputable vendor, Salesforce. So it does not go down often. And when it does, they normally restore it in less than a day. But the business process is a critical medical logistics process - meaning if a transaction is delayed for a few days, lives may be lost.
Background
Our transaction volumes are moderate. We probably serve 25 new patients per day, with a few hundred interactions each day to support those. In the even of an outage, a subset of those might need immediate manual intervention to keep things moving. Others might be able to wait a couple of days.
We already use Snowflake to store replicas of the application's data. We use Looker to write analytics reports.
Proposed Solution
Write reports that expose critical data that may be needed if the primary application fails. Then, when the primary application fails, users can view reports using the latest replicated data to enable manual activities to keep things going until the primary application is restored to working order.
If data changes are needed, they must be written down somewhere and then applied to the application when its availability is restored
Your only issue could be latency, as it is today Snowflake is not built for OLTP workloads, but OLAP workloads.
If the latency you get when running queries from Snowflake is fine then you have a valid use case.
Snowflake is used as an Application backend - particularly if the query is about historical analysis and latency is acceptable at a few seconds as opposed to immediate.
See: https://www.snowflake.com/workloads/data-applications/

continuous database sync with azure functions

I am developng an app with xamarin and azure serverless functions as backend for the app.
I will be syncing map coordinates from the users in real time with a database on the cloud. i.e : taking coordinates from all users and then pushing the updated coordinates to all users at the same time, continuously so that all users can see live location of each other.
So I have to call an azure function in continous loop in order to sync database with cloud. so it can check db after every, 4-5 secs. is it the best way to do this? or will this cause too much execution of azure function and might be costly? If there is a better way to sync the db please suggest. Thankyou.
You have a mobile app that is making http calls to azure functions. Functions is elastic and scale will likely be fine. As I understand, you're not asking how to implement the server side of this; rather the real question here is pricing, right?
Azure Functions can run in two ways:
"serverless", aka "Consumption plan". In this case, Azure Functions is managing the underlying server (and scale out) and you pay only for active usage (per GB*Sec). This is what you get by default when you visit http://functions.Azure.com. See pricing details here: https://azure.microsoft.com/en-us/pricing/details/functions/
"AppService" - In this case, you've bought a VM upfront and you decided how much to scale. You pay a fixed monthly cost. See pricing details here: https://azure.microsoft.com/en-us/pricing/details/app-service/
You can switch between them. I'd recommend starting with the first approach. It will certainly be cheaper in the beginning when you have low traffic. Monitor cost, run your scenarios through the pricing sheets, and consider switching to the 2nd if it ends up being cheaper.

How often do Virtual Machines migrate in Google Cloud?

I just read an interesting article about Google Transparent Maintenance and now I am thinking about how often a virtual machine migrates per day.
I didn't find any statistics so I am wondering if anybody can give me some information about this?
Another point that comes into my mind is doing live migration not only for maintenance purpose but for saving energy by migrating virtual machines and power off physical hosts. I found many papers discussing this topic so I think it's a big topic in currently research.
Does somebody know if there is any cloud provider already performing something like this?
How often do scheduled infrastructure maintenance events happen?
Infrastructure maintenance events don't have a set interval between occurrences, but generally happen once every couple of months.
How do I know if an instance will be undergoing a infrastructure maintenance event?
Shortly before a maintenance event, Compute Engine changes a special attribute in a virtual machine's metadata server before any attempts to live migrate or terminate and restart the virtual machine as part of a pending infrastructure maintenance event. The maintenance-event attribute will be updated before and after an event, allowing you to detect when these events are imminent. You can use this information to help automate any scripts or commands you want to run before and/or after a maintenance event. For more information, see the Transparent maintenance notice documentation.
SOURCE: Compute Engine Frequently Asked Questions.

Central data management for custom desktop applications

I have a background in web programming where both the data and the code live on the server. Web hosts with mysql or the like are plentiful and cheap so using the application from multiple pcs was never a problem.
However I'm considering switching to building desktop applications but the only factor that annoys me is the syncing of data across the many pcs I use. I was thinking of perhaps setting up a light amazon ec2 instance with a postgresql on it and having my desktop applications use that.
I have a few questions:
I'm curious as to what latency I might expect by running the database on ec2 instead of the local network, any experience or insight is appreciated.
Are there better/more obvious/cheaper solutions?
I've looked at the pricing and it seems to come down to 24.48$ per month for a yearly contract. Whilst not really expensive, it is not exactly cheap either. At what point does it become more interesting to run a local server?
I'm obviously not using my applications for large parts of the day (sleep, work,...). I was wondering if I can have the amazon server go into a sort of "sleep" mode and wake up when poked. An initial delay for the first desktop application is acceptable. The reason behind this behavior would be to save money on the instance if it is only actually needed for 10% of the day.
I welcome any feedback at all on how this problem is best tackled.
This could get ugly. Every single query you do will have latency associated with it. If you have a lot of queries, this can add up very fast. So keep your query count low, and try to pre-fetch and cache data when possible.
Not enough information to answer that question.
Depends on the cost of your local server. Keep in mind that you will need to pay for electricity to keep it on.
You can stop your instance when you are not needing it, with the exception of high utilization reservations, you wont get billed when its in stopped state. With high utilization reservations you will still pay the full cost.

Reducing impact on server load caused by long but non-priority adhoc queries

I have a couple of applications which do long queries in an OLTP database. They however have a significant impact on database server load.
Is it possible to run them with low priority? Is still intend to allow the user make adhoc queries, but response time is not critical.
Please advice solutions for oracle and/or sqlserver.
If you're using 11g, then perhaps the Database Resource Manager will help you out. The resource manager allows you to change consumer groups based on I/O consumption, something that was unavailable in prior releases. If not, the best you can do is lower priority based on CPU use.
Place resource limits on their accounts via profiles. Here is a link:
http://psoug.org/reference/profiles.html
For this you can use Oracle Resource Manager.
Most important for this is that you need to have an idea how Resource Manager can pick out which session to throttle. You can have lots of criteria to assign a user to a resource consumer group. Often username is used but this can be a few other things like machine, module etc. See Creating Consumer Group Mapping Rules (http://download.oracle.com/docs/cd/B28359_01/server.111/b28310/dbrm004.htm#CHDEDAIB)
Specifying Automatic Switching by Setting Resource Limits (http://download.oracle.com/docs/cd/B28359_01/server.111/b28310/dbrm004.htm#CHDDCGGG) could be very useful for you since all users are starting in the same OLTP group. Some start long running adhoc queries. You want those sessions to switch to a lower priority group for the duration of that call.
There could be one little snag: if that throttled session has locks, those locks will stay longer and might cause problems elsewhere.

Resources