When the database is becoming huge, how to divide it and span to multiple servers?
How huge? Single instance SQL Server deployments are capable of handling peta-byte databases.
For scale-out one option to look at is Peer-to-Peer Transactional Replication, which can do an in-place scale-out of an application not explicitly designed for such.
Applications that are designed for scale-out ahead of time have more options, for instance consider how MySpace spans over individual 1000 databases by using a message buss.
For more specific answers, you have to provide more specific details about your real case.
Related
Looking for suggesting on loading data from SQL Server into Elasticsearch or any other data store. The goal is to have transactional data available in real time for Reporting.
We currently use a 3rd party tool, in addition to SSRS, for data analytics. The data transfer is done using daily batch jobs and as a result, there is a 24 hour data latency.
We are looking to build something out that would allow for more real time availability of the data, similar to SSRS, for our Clients to report on. We need to ensure that this does not have an impact on our SQL Server database.
My initial thought was to do a full dump of the data, during the weekend, and writes, in real time, during weekdays.
Thanks.
ElasticSearch's main use cases are for providing search type capabilities on top of unstructured large text based data. For example, if you were ingesting large batches of emails into your data store every day, ElasticSearch is a good tool to parse out pieces of those emails based on rules you setup with it to enable searching (and to some degree querying) capability of those email messages.
If your data is already in SQL Server, it sounds like it's structured already and therefore there's not much gained from ElasticSearch in terms of reportability and availability. Rather you'd likely be introducing extra complexity to your data workflow.
If you have structured data in SQL Server already, and you are experiencing issues with reporting directly off of it, you should look to building a data warehouse instead to handle your reporting. SQL Server comes with a number of features out of the box to help you replicate your data for this very purpose. The three main features to accomplish this that you could look into are AlwaysOn Availability Groups, Replication, or SSIS.
Each option above (in addition to other out-of-the-box features of SQL Server) have different pros and drawbacks. For example, AlwaysOn Availability Groups are very easy to setup and offer the ability to automatically failover if your main server had an outage, but they clone the entire database to a replica. Replication let's you more granularly choose to only copy specific Tables and Views, but then you can't as easily failover if your main server has an outage. So you should read up on all three options and understand their differences.
Additionally, if you're having specific performance problems trying to report off of the main database, you may want to dig into the root cause of those problems first before looking into replicating your data as a solution for reporting (although it's a fairly common solution). You may find that a simple architectural change like using a columnstore index on the correct Table will improve your reporting capabilities immensely.
I've been down both pathways of implementing ElasticSearch and a data warehouse using all three of the main data synchronization features above, for structured data and unstructured large text data, and have experienced the proper use cases for both. One data warehouse I've managed in the past had Tables with billions of rows in it (each Table terabytes big), and it was highly performant for reporting off of on fairly modest hardware in AWS (we weren't even using Redshift).
We have a medium-sized web application (multiple instances), querying against a single SQL Server 2014 database.
Not the must robust architecture, no clustering/failover, and we have been getting a few deadlocks recently.
I'm looking at how i can improve the performance and availability of the database, reduce these deadlocks, and have a better backup/failover strategy.
I'm not a DBA, so looking for some advice here.
We currently have the following application architecture:
Multiple web servers reading and writing to a single SQL Server DB
Multiple background services reading and writing to the same single SQL Server DB
I'm contemplating making the following changes:
Split the single DB into two DB's, one read-only and another read-write. The read-write DB replicates the data to the read-only DB using SQL Server replication
Web servers connect to the given DB depending on the operation.
Background servers connect to the read-write DB (most the writes happen here)
Most of the DB queries on the web servers are reads (and a lot of the writes can be offloaded to the background services), so that's the reason for my thoughts here.
I could then also potentially add clustering to the read-only databases.
Is this a good SQL Server database architecture? Or would the DBA's out there simply suggest a clustering approach?
Goals: performance, scalability, reliability
Without more specific details about your server, it's tough to give you specific advice (for example, what's a medium-sized web application? what are the specs on your database server? What's your I/O latency like? CPU contention? Memory utilization?)
At a high level of abstraction, deadlocks usually occur because of two reasons:
Your reads are too slow, and
Your writes are too slow.
There's lots of ways to address both of those issues, but in general:
You can cover a lot of coding sins with good hardware, and
Don't re-architect a solution until you've pursued performance tuning options (including indexing strategies and/or procedure rewrites).
Clustering is generally considered to be used as a strategy for High Availability/Disaster Recovery, not performance augmentation (there are always exceptions).
I have been tasked with getting a copy of our SQL Server 2005/2008 databases in the field on-line internally and update them daily. Connectivity with each site is regulated, so on-line access is not an option. Field databases are Workgroup licensed. Main server is Enterprise with some obscene number of processors and RAM. The purpose of the copies is two-fold: (1) on-line backup and (2) source for ETL to the data warehouse.
There are about 300 databases, identical schema for the most part, located throughout the US, Canada and Mexico. Current DB size is between 5 GB and over 1 TB. Activity varies, but is about a 1,500,000 new rows daily on each server, mostly in 2 tables. About 50 tables total in each. Connection quality and bandwidth with each site varies, but the main site has enough bandwidth to do many sites in parallel.
I'm thinking SSIS, but am not sure how to approach this task other than table-by-table. Can anyone offer any guidance?
Honestly, I would recommend using SQL replication. We do this quite a bit, and it will even work over dialup. It basically minimizes traffic needed as only changes are transferred.
There are several topologies. We only use merge (two way), but transactional might be OK for your needs (one way).
Our environments are a single central DB, replicating (using filtered replication articles) to various site databases. The central DB is the publisher. It is robust, once in place, but is a nuisance for schema upgrades.
However, given your databases aren't homogeneous, it might be easier to set it up where the remote site is the publisher, and the central SQL instance has a per site database that is a subscriber to the site publisher. The articles wouldn't even need to be filtered. And then you can process the individual site data centrally.
Note to have the site databases would need replication components installed (they are generally optional in the installer). To be setup as publishers they'd need local configuration also (distribution configured on each one). Being workgroup edition, it can act as a publisher. SQL express can't act as a publisher.
It sounds complicated, but it is really just procedural, and an inbuilt mechanism for doing this sort of thing.
We have a decent sized, write-heavy database that is about 426 GB (including indexes) and about 300 million rows . We currently collect location data from devices that report to our server every couple of minutes, and we serve about 10,000 devices - so lots of writes every second. The location table that stores the location of each device has about 223 million rows. The data is currently archived by year.
Problems occur when users run large reports on this database, the whole database grinds down almost to a stop.
I understand I need a reporting database, but my question is if anyone has experience of using SQL Server Transactional Replication on a database of equivalent size, and their experience of using this technology?
My rough plan is to point all the reports in our application to the Reporting Database, use Transactional Replication to replicate the data over from the master to the slave (Reporting Database).
Anyone have any thoughts on this strategy and the problems I may encounter?
Many thanks!
Transactional replication should work well in this scenario (the only effect the size of the database will have is the time taken to generate the initial snapshot). However, it may not solve your problem.
I think the issue you'll have if you choose transactional replication is that the slave server is going to be under the same load as the master machine as changes are applied - it will still crawl when users run large reports (assuming it's of a similar spec).
Depending on the acceptable latency of reporting data to the live data, this may or may not be OK for your users.
If some latency is acceptable you may get better performance from log shipping, since changes are applied in batches.
Before acquiring a reporting server, another approach would be to investigate the queries that your users are running and look at modifying either their code or the indexing strategy to better match what they're trying to do.
Transactional Replication could work well for you. The things to consider:
The target database tables must be read-only.
The server containing the target database should be stout enough to handle the SELECT traffic from the reporting applications.
Depending on the INSERT/UPDATE traffic, you may need to have a third server act as the Distribution server.
You also have to consider the size of the Distribution database.
Based on what I read here, I'd use a pull subscription from the Reporting server to offload traffic from the OLTP server.
You can skip the torment of a snapshot by initializing the reporting database from a backup of the OLTP database. See https://msdn.microsoft.com/en-us/library/ms151705.aspx
There will be INSERT/UPDATE/DELETE traffic from the Replication into both the Distribution and the Subscriber databases. That requires consideration, but lock/block issues should be no worse (and probably better) than running those reports off of OLTP.
I am running multiple publications on a 2.6TB database with 2.5GB/day of growth, using both pure transactional to drive reports (to two reporting servers) and Peer-to-Peer Transactional to replicate data in a scale-out for a SaaS offering (to three more servers). Because of this, we have a separate distributor.
Hope this helps.
Thanks
John.
For our application(desktop in .net), we want to have 2 databases in 2 different remote places(different countries).Is it possible to use replication to keep the data in sync in both the databases while application changes data?. What other strategies can be used? Should the sync happen instantaneously or, at a scheduled time? What if we decide to keep one database 'readonly'?
thanks
You need to go back to your requirements I think.
Does data need to be shared between two sites?
Can both sites update the same data?
What's the minimum acceptable time for an update in one location to be visible in another?
Do you need failover/disaster recovery capability?
Do you actually need two databases? (e.g is it for capacity, for failover or simply because the network link between the two sites is slow? etc)
Any other requirements around data access/visibility?
Real-time replication is one solution, an overnight extract-transform-load process could be another. It really depends on your requirements.
I think the readonly question is key. If one database is readonly then you can use mirroring to sync them, assuming you have a steady connection.
What is the bandwidth and reliability of connection between the sites?
If updates are happening at both locations (on the same data) then Merge Replication is a possibility. It's really designed for mobile apps where users in the field have some subset of the data and conflicts may need to be resolved at replication time.
High level explanation of the various replication types in SQL Server including the new Sync Framework in SQL Server 2008 can be found here: http://msdn.microsoft.com/en-us/library/ms151198.aspx
-Krip