Is one node reliable in Cassandra? - database

I'm new to Cassandra, In all the tutorials I read, it is mentioned that we have a few nodes in Cassandra architecture because if one has a problem, the others can do it.
Does using only one node put us at risk of data loss?
I mostly worked with relational databases. Using one node is not a problem in RDBMS (except that for some reason the service may not be available but the data is still stored)
My project does not require high availability, only I have very large data and write a lot data, so I chose Cassandra, but I want use it with only one node
Is this a problem for me? Is my data compromised?

Using a single server can result in data loss - even on a relational database since while you might have regular backups, the commit log / redo log is stored on the server disk until archived away from the server. So a failure of the server disk on the server would result in that log being lost and data since the last backup that is held off the server.
Using multiple servers (relational mirror or distributed no-sql such as Cassandra) provides extra resilience as well as reducing the chance of data loss - since there are 2 or more copies.
In a 3 node cluster with a replication factor of 3 on Cassandra, to get the same data loss as the single server scenario would require all 3 servers going down at once and all of them losing their disks (this assumes local disk, if you use a SAN, you lose the benefit.) - this lowers the percentage chance of data loss considerably.

Related

Are CoW snapshots the solution to safely pull data from critial OLTP databases for reporting?

Our IT team copies data from mission-critical SQL Server OLTP databases in what seems to be a naive way - basically just INSERT INTO ... SELECT * every night. We use this copied data database for reporting. This is unsatisfactory for various reasons but we're told it is the only way because uncontrolled user query execution could compromise OLTP performance & data integrity. I want an improvement that still addresses their concerns.
Copy-on-write snapshots are the best solution I've read about (we don't need up-to-the-minute data for reporting), but please comment on the following:
The snapshot's sparse files should be placed on a separate physical drive (so that snapshot reads/writes can occur without limiting disk throughput for OLTP tasks).
There should be a single NTFS filesystem spanning all physical disks (on a hunch that would work better than putting the online database its snapshots on logically separated volumes).
Create the filesystem with the /L:enable flag (so it works better with large sparse files).
Avoid multiple snapshots (since original data would have to be copied to each one).
We could use a single snapshot MyDB_LatestSnapshot that could be deleted and very quickly re-created every day, or even throughout the day (so long as kicking users running reports off it is acceptable).
Since the database snapshots will always be recent, most data will not have changed and so it will still have to be retrieved from the same drive as the online OLTP database, so increased resource (CPU/RAM) use is inevitable. Won't a long-running reporting query that pulls years of historical data (including data that hasn't changed and therefore doesn't exist in the snapshot) block writes just as if it were running against the online database?
Is there any way to tell SQL Server to prioritize resources for the needs of the OLTP database?
I've found examples of how original rows are copied from the online database when they're updated, but how do snapshots handle structural changes in the new database, like new/altered tables, indexes, etc.?
Can snapshots have different user permissions versus the online database (so that users can read from the snapshot, but not the online database)?
The OLTP system runs core banking applications, so I understand utmost caution is justified, but I can't believe the current approach is best practice in 2022.

Minimal configuration for VoltDB to be able to show durability and HA

My exposure to NoSQL or NewSQL/NeoSQL database servers is extremely limited, only theoretical. I've worked with traditional RDBMSs (like MySQL, Postgres) and directory-server (OpenLDAP), with and without replication.
My application stack is based on JBoss, and I've been tasked with setting up a minimal demo (with our application) that can demonstrate durability and high-availability of data, in VoltDB. Performance testing, is not an objective at all.
Have been going thru the VoltDB Planning Guide, but I am confused between the "+1" or "x2" in terms of number of servers (or VoltDB instances) required. Especially given these 2 statements:-
The easiest way to size hardware for a K-Safe cluster is to size the
initial instance of the database, based on projected throughput and
capacity, then multiply the number of servers by the number of
replicas you desire (that is, the K-Safety value plus one).
Rule of Thumb
When using K-Safety, configure the number of cluster nodes as a whole multiple of the number of copies of the database
(that is, K+1)
Questions:
Now, let's say that I need 1 server given capacity/throughput
requirements. So, to be able to have durability and
high-availability, do I need: 2, 3 or 4 servers ?
OTOH, using just 1 server, what all key features of VoltDB would I
have to forgo ?
Is there any relationship (or conflict) between VoltDB's full
disk-persistence and snapshots ? Say, the availability of disk-persistence
removes the need for snapshots ?
If you use 2 servers, you can keep a synchronous replica of data to protect from data loss, much like a RAID1 hard drive. Your data is double-safe, but there is a catch with availability. With only two servers, it's impossible to differentiate a network split from a failed node. In some cases, VoltDB will shut down a live node when another fails to ensure there will be no split brain. With 3 nodes, this won't be an issue and the cluster will remain available after any single node failure (with k=1 or k=2).
With just 1 server, all you lose is the multiple copies of data on multiple servers and the high-availability features that allow VoltDB to continue running after a node failure. You still have all of the other VoltDB features, including full disk persistence.

How does Replication work in a Distributed Database

I would like to know how replication works in a distributed database. It would be nice if this could be explained in a thorough, yet easy to understand way.
It would also be nice if you could make a comparison between distributed transactions and distributed replication.
Single point of failure
The database server is a central part of an enterprise system, and, if it goes down, service availability might get compromised.
If the database server is running on a single server, then we have a single point of failure. Any hardware issue (e.g., disk drive failure) or software malfunction (e.g., driver problems, malfunctioning updates) will render the system unavailable.
Limited resources
If there is a single database server node, then vertical scaling is the only option when it comes to accommodating a higher traffic load. Vertical scaling, or scaling up, means buying more powerful hardware, which provides more resources (e.g., CPU, Memory, I/O) to serve the incoming client transactions.
Up to a certain hardware configuration, vertical scaling can be a viable and simple solution to scale a database system. The problem is that the price-performance ratio is not linear, so after a certain threshold, you get diminishing returns from vertical scaling.
Another problem with vertical scaling is that, in order to upgrade the server, the database service needs to be stopped. So, during the hardware upgrade, the application will not be available, which can impact underlying business operations.
Database Replication
To overcome the aforementioned issues associated with having a single database server node, we can set up multiple database server nodes. The more nodes, the more resources we will have to process incoming traffic.
Also, if a database server node is down, the system can still process requests as long as there are spare database nodes to connect to. For this reason, upgrading the hardware or software of a given database server node can be done without affecting the overall system availability.
The challenge of having multiple nodes is data consistency. If all nodes are in-sync at any given time, the system is Linearizable, which is the strongest guarantee when it comes to data consistency across multiple registers.
The process of synchronizing data across all database nodes is called replication, and there are multiple strategies that we can use.
Single-Primary Database Replication
The Single-Primary Replication scheme looks as follows:
The primary node, also known as the Master node, is the one accepting writes while the replica nodes can only process read-only transactions. Having a single source of truth allows us to avoid data conflicts.
To keep the replicas in-sync, the primary nodes must provide the list of changes that were done by all committed transactions.
Relational database systems have a Redo Log, which contains all data changes that were successfully committed.
PostgreSQL uses the WAL (Write-Ahead Log) records to ensure transaction Durability and for Streaming Replication.
Because the storage engine is separated from the MySQL server, MySQL uses a separate Binary Log for replication. The Redo Log is generated by the InnoDB storage engine, and its goal is to provide transaction Durability while the Binary Log is created by the MySQL Server, and it stores the logical logging records, as opposed to physical logging created by the Redo Log.
By applying the same changes recorded in the WAL or Binary Log entries, the replica node can stay in-sync with the primary node.
Horizontal scaling
The Single-Primary Replication provides horizontal scalability for read-only transactions. If the number of read-only transactions increases, we can create more replica nodes to accommodate the incoming traffic.
This is what horizontal scaling, or scaling out, is all about. Unlike vertical scaling, which requires buying more powerful hardware, horizontal scaling can be achieved using commodity hardware.
On the other hand, read-write transactions can only be scaled up (vertical scaling) as there is a single primary node.
I would recommend initially spending time reviewing the MySQL Docs on Replication. It's a good example of database replication. They are here:
http://dev.mysql.com/doc/refman/5.5/en/replication.html
Covering the entire scope of your question seems like too much for one question.
If you have some specific questions, please feel free to post them. Thanks!
Clustrix is a distributed database with a shared nothing architecture that supports both distributed transactions and replication. There is some technical documentation available that describes data distribution, distributed evaluation model, and built in fault tolerance, as well as an overview of the architecture.
As a MySQL replacement, Clustrix implements MySQL's replication policy and produces binlogs in the MySQL format, which are serialized so that Clustrix can act as either a Master or Slave to MySQL.

Transactional Replication For Write Heavy Medium Sized Database

We have a decent sized, write-heavy database that is about 426 GB (including indexes) and about 300 million rows . We currently collect location data from devices that report to our server every couple of minutes, and we serve about 10,000 devices - so lots of writes every second. The location table that stores the location of each device has about 223 million rows. The data is currently archived by year.
Problems occur when users run large reports on this database, the whole database grinds down almost to a stop.
I understand I need a reporting database, but my question is if anyone has experience of using SQL Server Transactional Replication on a database of equivalent size, and their experience of using this technology?
My rough plan is to point all the reports in our application to the Reporting Database, use Transactional Replication to replicate the data over from the master to the slave (Reporting Database).
Anyone have any thoughts on this strategy and the problems I may encounter?
Many thanks!
Transactional replication should work well in this scenario (the only effect the size of the database will have is the time taken to generate the initial snapshot). However, it may not solve your problem.
I think the issue you'll have if you choose transactional replication is that the slave server is going to be under the same load as the master machine as changes are applied - it will still crawl when users run large reports (assuming it's of a similar spec).
Depending on the acceptable latency of reporting data to the live data, this may or may not be OK for your users.
If some latency is acceptable you may get better performance from log shipping, since changes are applied in batches.
Before acquiring a reporting server, another approach would be to investigate the queries that your users are running and look at modifying either their code or the indexing strategy to better match what they're trying to do.
Transactional Replication could work well for you. The things to consider:
The target database tables must be read-only.
The server containing the target database should be stout enough to handle the SELECT traffic from the reporting applications.
Depending on the INSERT/UPDATE traffic, you may need to have a third server act as the Distribution server.
You also have to consider the size of the Distribution database.
Based on what I read here, I'd use a pull subscription from the Reporting server to offload traffic from the OLTP server.
You can skip the torment of a snapshot by initializing the reporting database from a backup of the OLTP database. See https://msdn.microsoft.com/en-us/library/ms151705.aspx
There will be INSERT/UPDATE/DELETE traffic from the Replication into both the Distribution and the Subscriber databases. That requires consideration, but lock/block issues should be no worse (and probably better) than running those reports off of OLTP.
I am running multiple publications on a 2.6TB database with 2.5GB/day of growth, using both pure transactional to drive reports (to two reporting servers) and Peer-to-Peer Transactional to replicate data in a scale-out for a SaaS offering (to three more servers). Because of this, we have a separate distributor.
Hope this helps.
Thanks
John.

SQL server 2005 replication to many slave servers - hardware replication or change the strategy

we have a 500gb database that performs about 10,000 writes per minute.
This database has a requirements for real time reporting. To service this need we have 10 reporting databases hanging off the main server.
The 10 reporting databases are all fed from the 1 master database using transactional replication.
The issue is that the server and replication is starting to fail with PAGEIOLATCH_SH errors - these seem to be caused by the master database being overworked. We are upgrading the server to a quad proc / quad core machine.
As this database and the need for reporting is only going to grow (20% growth per month) I wanted to know if we should start looking at hardware (or other 3rd party application) to manage the replication (what should we use) OR should we change the replication from the master database replicating to each of the reporting databases to the Master replicating to reporting server 1, reporting server 1 replicating to reporting server 2
Ideally the solution will cover us to a 1.5tb database, with 100,000 writes per minute
Any help greatly appreciated
One common model is to have your main database replicate to 1 other node, then have that other node deal with replicating the data out from there. It takes the load off your main server and also has the benefit that if, heaven forbid, your reporting system's replication does max out it won't affect your live database at all.
I haven't gone much further than a handful of replicated hosts, but if you add enough nodes that your distribution node can't replicate it all it's probably sensible to expand the hierarchy so that your distributor is actually replicated to other distributors which then replicate to the nodes you report from.
How many databases you can have replicated off a single node will depend on how up-to-date your reporting data needs to be (EG: Whether it's fine to have it only replicate once a day or whether you need to the second) and how much data you're replicating at a time. Might be worth some experimentation to find out exactly how many nodes 1 distributor could power if it didn't have the overhead of actually running your main services.
Depending on what you're inserting, a load of 100,000 writes/min is pretty light for SQL Server. In my book, I show an example that generates 40,000 writes/sec (2.4M/min) on a machine with simple hardware. So one approach might be to see what you can do to improve the write performance of your primary DB, using techniques such as batch updates, multiple writes per transaction, table valued parameters, optimized disk configuration for your log drive, etc.
If you've already done as much as you can on that front, the next question I have is what kind of queries are you doing that require 10 reporting servers? Seems unusual, even for pretty large sites. There may be a bunch you can do to optimize on that front, too, such as offloading aggregation queries to Analysis Services, or improving disk throughput. While you can, scaling-up is usually a better way to go than scaling-out.
I tend to view replication as a "solution of last resort." Once you've done as much optimization as you can, I would look into horizontal or vertical partitioning for your reporting requirements. One reason is that partitioning tends to result in better cache utilization, and therefore higher total throughput.
If you finally get to the point where you can't escape replication, then the hierarchical approach suggested by fyjham is definitely a reasonable one.
In case it helps, I cover most of these issues in depth in my book:
Ultra-Fast ASP.NET.
Check that your publisher and distributor's transaction log files don't have too many VLFs (Virtual Log Files) as detailed here (step 8):
http://www.sqlskills.com/BLOGS/KIMBERLY/post/8-Steps-to-better-Transaction-Log-throughput.aspx
If your distribution database is co-located with you publisher database, consider moving it to its own dedicated server.

Resources