What is difference between 3 type of replication?
Is replication suitable for data archiving?
What is the replication steps?
Following are the three types of replication in SQL server.
Transactional replication
Merge replication
Snapshot replication
For more See http://technet.microsoft.com/en-us/library/ms152531.aspx
Replication can be used for archiving purposes as well but with some additional mechanisms. Most of the time I have seen, it is used in data warehousing scenarios to reduce load on the OLTP system.
See http://msdn.microsoft.com/en-us/library/ms151198.aspx
From MSDN:
Transactional replication is typically used in server-to-server scenarios that require high throughput, including: improving scalability and availability; data warehousing and reporting; integrating data from multiple sites; integrating heterogeneous data; and offloading batch processing. Merge replication is primarily designed for mobile applications or distributed server applications that have possible data conflicts. Common scenarios include: exchanging data with mobile users; consumer point of sale (POS) applications; and integration of data from multiple sites. Snapshot replication is used to provide the initial data set for transactional and merge replication; it can also be used when complete refreshes of data are appropriate. With these three types of replication, SQL Server provides a powerful and flexible system for synchronizing data across your enterprise.
There are lots of other articles on the Internet and lots of good books on SQL Server. It's a bit of a broad question to ask what it is AND how to implement it.
Related
Currently, I generate data on a different datastore and replicate to Snowflake Staging, then that data moves to the Data Warehouse DB through ELT ingestion for Analytics purpose. However this approach can be considered as creating data-silos in itself, since we already have 3 copies of the same data:
Transactional data-store DB
Replicated snowflake staging
Snowflake Data Warehouse DB
From a technical architecture point of view, is it a good idea to use Snowflake as a direct datastore for transactional application? (application that does many CRUD operations). That may help in avoiding the cost of replication and ingestion.
The main problem I see with this approach is that: Snowflake does not enforce any referential integrity (primary keys, foreign keys) so within the CRUD app, I have to either use a MERGE statement always or somehow make sure I don't create duplicate records.
The other problem being in the cloud, the distance (aka network) between the app and snowflake decides the performance of the transactions, I want good, consistent performance of my CRUD operations.
Any thoughts/suggestions are much appreciated.
Snowflake as of today does not perform well with singleton updates and inserts, which is what we see mostly with transactional databases. I have seen a performance degradation when using singleton inserts are submitted against Snowflake.
On the contrary, they are very optimized for bulk ingestion of unstructured data and structured data though and are designed for OLAP warehouses. You can still use it but you may see the same performance degradation. Also, primary keys can be defined but they are not enforced.
In my opinion, if you are faced with that challenge, you have the option to use a Postgre SQL DB (open source) in the cloud as your transactional database and it acts as a good complement to Snowflake as the OLAP database.
No. Snowflake isn't good as a transactional / OLTP database for the reasons you've mentioned. Plus, it won't perform well with many individual CRUD operations due to how they structure the data (optimised for OLAP workloads).
Just want to point out that there are benefits to creating separate databases, for one you want to isolate your transactional database from that of your analytics database otherwise you could be significantly affect the performance of the application. Secondly, the data in the transactional database could change and if you had to reprocess the data for whatever reason you may not be able to do so. There are many more, but I will stop here :-)
I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.
I am currently examining different NoSQL and RDBMSes regarding their replication abilities in order to build distributed systems.
Reading through several papers and books, I get the feeling, that some vendors or authors use their own definitions regarding the terms
Master-Master Replication (Replication between two servers)
Master-Slave Replication (Replication between mutliple Servers in order to increase reading speed, writes are only able for the master server)
Multi-Master Replication (= Peer-To-Peer?)
Peer-To-Peer Replication (replication between n nodes, each can read/write)
Merge Replication (?)
E.g: Some mix up the terms Master-Master and Peer-to-Peer as the same, while in Mysql docus for instance I found it is differentiated between Master-Master and Multi-Master (=Peer-to-peer???) Replication.
Where is the difference in Multi-Master and Peer-to-Peer replication?
Is Multi-Master replication's use case more oriented towards Clustering while Peer-To-Peer targets distributed content to distributed applications?
I would like to sort things out and be sure that I have the right understanding in these terms, so maybe a discussion in here would help to merge some knowledge.
Regards, Chris
Edit: added merge replication to the list and some explanations as I understand them...
Regarding CouchDB, the story is simple. Here it is:
There is only one replication mode for CouchDB. The source copies all its data to the target, subject to an optional yes/no filter. I described CouchDB replication in another question. The key point is that "replication" is simply a DB client. It connects to both couches, reads from the source, and writes to the target.
Any other big-picture architecture (peer-to-peer, multi-master, master-slave) is just the implementation of the developers or the system administrators. For example, if GETs are distributed to many couches, but POST go to one central couch which replicates to the others, that is effectively master-slave. If you put a CouchDB in every major city for performance, and they replicate directly with each other, that is multi-master replication.
Within the CouchDB community, and especially from Chris Anderson's projects and presentations, "peer-to-peer" replication is a concept where CouchDB is everywhere: mobile phones, data centers, telephone poles. And replication happens directly between couches in a decentralized way, without a central authority or architecture, like the web itself.
When the database is becoming huge, how to divide it and span to multiple servers?
How huge? Single instance SQL Server deployments are capable of handling peta-byte databases.
For scale-out one option to look at is Peer-to-Peer Transactional Replication, which can do an in-place scale-out of an application not explicitly designed for such.
Applications that are designed for scale-out ahead of time have more options, for instance consider how MySpace spans over individual 1000 databases by using a message buss.
For more specific answers, you have to provide more specific details about your real case.
We have a decent sized, write-heavy database that is about 426 GB (including indexes) and about 300 million rows . We currently collect location data from devices that report to our server every couple of minutes, and we serve about 10,000 devices - so lots of writes every second. The location table that stores the location of each device has about 223 million rows. The data is currently archived by year.
Problems occur when users run large reports on this database, the whole database grinds down almost to a stop.
I understand I need a reporting database, but my question is if anyone has experience of using SQL Server Transactional Replication on a database of equivalent size, and their experience of using this technology?
My rough plan is to point all the reports in our application to the Reporting Database, use Transactional Replication to replicate the data over from the master to the slave (Reporting Database).
Anyone have any thoughts on this strategy and the problems I may encounter?
Many thanks!
Transactional replication should work well in this scenario (the only effect the size of the database will have is the time taken to generate the initial snapshot). However, it may not solve your problem.
I think the issue you'll have if you choose transactional replication is that the slave server is going to be under the same load as the master machine as changes are applied - it will still crawl when users run large reports (assuming it's of a similar spec).
Depending on the acceptable latency of reporting data to the live data, this may or may not be OK for your users.
If some latency is acceptable you may get better performance from log shipping, since changes are applied in batches.
Before acquiring a reporting server, another approach would be to investigate the queries that your users are running and look at modifying either their code or the indexing strategy to better match what they're trying to do.
Transactional Replication could work well for you. The things to consider:
The target database tables must be read-only.
The server containing the target database should be stout enough to handle the SELECT traffic from the reporting applications.
Depending on the INSERT/UPDATE traffic, you may need to have a third server act as the Distribution server.
You also have to consider the size of the Distribution database.
Based on what I read here, I'd use a pull subscription from the Reporting server to offload traffic from the OLTP server.
You can skip the torment of a snapshot by initializing the reporting database from a backup of the OLTP database. See https://msdn.microsoft.com/en-us/library/ms151705.aspx
There will be INSERT/UPDATE/DELETE traffic from the Replication into both the Distribution and the Subscriber databases. That requires consideration, but lock/block issues should be no worse (and probably better) than running those reports off of OLTP.
I am running multiple publications on a 2.6TB database with 2.5GB/day of growth, using both pure transactional to drive reports (to two reporting servers) and Peer-to-Peer Transactional to replicate data in a scale-out for a SaaS offering (to three more servers). Because of this, we have a separate distributor.
Hope this helps.
Thanks
John.