Containers for database and scalability - database

Consider TiDB and the TiDB Operator as examples for this question.
TiDB
TiDB ("Ti" stands for Titanium) is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.
TiDB Operator
The TiDB Operator automatically deploys, operates, and manages a TiDB cluster in any Kubernetes-enabled cloud environment.
Once the database is live, there are broadly two scenarios ever.
Very high rate of read only queries.
Very high rate of write queries.
In either of the scenarios, which component of the containerized database scales? Read replicas? Database 'engine' itself? Persistent volumes? All of the above?

Containerized infrastructure abstracts storage and computing resources
(consider PV and Pod in k8s), and these resources scale as the database scales. So the form of scaling depends on the database itself.
For TiDB, while it offers MySQL compatible SQL interface, its architecture is is very different from MySQL and other traditional relational databases:
The SQL layer(TiDB) serves SQL queries and interacts with the storage layer based on the calculated query plan. It is stateless and scales on demand for both read and write queries. Typically, you scale out/up the SQL layer to get more compute resources for query plan calculation, join, aggregation and serving more connections.
The Storage layer(TiKV) is responsible for storing data and serving KV APIs for the SQL layer. The most interesting part of TiKV is the Multi-raft replication, The storage layer automatically splits data into pieces and distributes them to containers evenly. Each pieces is a raft group whose leader serves read and write queries. Upon scale in/out, the storage layer will automatically migrates data pieces to balance the load. So, scale out the storage layer will give you better read/write throughput and large data capacity.
Back to the question, all of the components mentioned in the question scales. The read/write replicas serving SQL queries can scale, the database "engine"(storage layer) serving KV queries can scale, and the PV is also scaled out along with the scaling process of the storage layer.

Related

Regarding the burden on Snowflake's database storage layer

Snowflake has an architecture consisting of the following three layers.
・Database storage
・Query processing
・Cloud service
I understand that it is possible to create a warehouse for each process in query processing, and scale up and scale out on a per process basis.
However, when the created warehouses (processes) are processed in parallel, I am worried about the burden on the database storage.
Even though the query processing process can be load-balanced, since there is only one database storage, wouldn't there be a lot of parallel processing running in the database storage and an error occurring in the database storage layer?
Sorry if I don't understand the architecture.
The storage is immutable, thus the query read load is just IO against cloud provider IO layers, so for all purposes infinitely scalable.
When any node updates a table, the new set of file partitions are known, and any warehouse without the new set of partition parts, does remote IO to read them.
The only downsides to this pattern is it does not scale well for transactional write patterns, thus why that is not the targeted at those markets.

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Precise difference between distributed database and decentralised database

Can anyone just explain me the precise difference between distributed database and decentralised database?
Decentralized
It means that there is no central storage. Some servers provide information to the clients. The servers are connected with each other.
Distributed
There are no data storages. All the nodes contain information. The clients are equal and have equal rights.
Main Difference
A distributed database is a single logical database, which is installed on a set of computers that are geographically located at different locations and linked through a data communication network whereas A decentralized database is a database that is installed on systems that are geographically located at different locations but not linked through a data communication network.
Coming to Blockchain it works on Centralized Relational Database and especially Distributed Database that leverage cryptography to provide multi version concurrency control mechanism and to maintain consensus about the existence and status of shared facts in trust-less environment.
Source
Database in blockchain video
The hierarchy is centralized to de-centralized to distributed.
de-centralized is simply centralized on a smaller (larger?) scale. The risk of losing data/some catastrophic event is reduced because there are lots of little 'centers'.
Distributed databases/ledgers have no 'center'. The risk of a catastrophic event is further reduced (although other risks (forks etc.) arise)

Is scaling out a NoSql database in the cloud easier than scaling out an RDBMS?

The company I work at uses Sql Server mainly for persistence and are planning to move to AWS. I hear that scaling out RDBMS in cloud is really hard and costly because of the Geographical regions that you have to cover for fail-over scenarios. does moving to a NoSql database alleviate the problem of scaling out in the cloud?
It doesn't matter what technology you use. You challenges on scaling would be very similar.
I hear that scaling out RDBMS in cloud is really hard and costly because of the Geographical regions
I don't agree. That's probably where cloud providers has most benefits. Say for example with one click you can enable multy-az RDS. Yes it's not cross -region but fault-tolerant for sure. Alternatively you can dump DB to S3 and enable recently launched cross-region replication of the data.
Another example is AWS Redshift that enables you to resize warehouse with no downtime.

How does Replication work in a Distributed Database

I would like to know how replication works in a distributed database. It would be nice if this could be explained in a thorough, yet easy to understand way.
It would also be nice if you could make a comparison between distributed transactions and distributed replication.
Single point of failure
The database server is a central part of an enterprise system, and, if it goes down, service availability might get compromised.
If the database server is running on a single server, then we have a single point of failure. Any hardware issue (e.g., disk drive failure) or software malfunction (e.g., driver problems, malfunctioning updates) will render the system unavailable.
Limited resources
If there is a single database server node, then vertical scaling is the only option when it comes to accommodating a higher traffic load. Vertical scaling, or scaling up, means buying more powerful hardware, which provides more resources (e.g., CPU, Memory, I/O) to serve the incoming client transactions.
Up to a certain hardware configuration, vertical scaling can be a viable and simple solution to scale a database system. The problem is that the price-performance ratio is not linear, so after a certain threshold, you get diminishing returns from vertical scaling.
Another problem with vertical scaling is that, in order to upgrade the server, the database service needs to be stopped. So, during the hardware upgrade, the application will not be available, which can impact underlying business operations.
Database Replication
To overcome the aforementioned issues associated with having a single database server node, we can set up multiple database server nodes. The more nodes, the more resources we will have to process incoming traffic.
Also, if a database server node is down, the system can still process requests as long as there are spare database nodes to connect to. For this reason, upgrading the hardware or software of a given database server node can be done without affecting the overall system availability.
The challenge of having multiple nodes is data consistency. If all nodes are in-sync at any given time, the system is Linearizable, which is the strongest guarantee when it comes to data consistency across multiple registers.
The process of synchronizing data across all database nodes is called replication, and there are multiple strategies that we can use.
Single-Primary Database Replication
The Single-Primary Replication scheme looks as follows:
The primary node, also known as the Master node, is the one accepting writes while the replica nodes can only process read-only transactions. Having a single source of truth allows us to avoid data conflicts.
To keep the replicas in-sync, the primary nodes must provide the list of changes that were done by all committed transactions.
Relational database systems have a Redo Log, which contains all data changes that were successfully committed.
PostgreSQL uses the WAL (Write-Ahead Log) records to ensure transaction Durability and for Streaming Replication.
Because the storage engine is separated from the MySQL server, MySQL uses a separate Binary Log for replication. The Redo Log is generated by the InnoDB storage engine, and its goal is to provide transaction Durability while the Binary Log is created by the MySQL Server, and it stores the logical logging records, as opposed to physical logging created by the Redo Log.
By applying the same changes recorded in the WAL or Binary Log entries, the replica node can stay in-sync with the primary node.
Horizontal scaling
The Single-Primary Replication provides horizontal scalability for read-only transactions. If the number of read-only transactions increases, we can create more replica nodes to accommodate the incoming traffic.
This is what horizontal scaling, or scaling out, is all about. Unlike vertical scaling, which requires buying more powerful hardware, horizontal scaling can be achieved using commodity hardware.
On the other hand, read-write transactions can only be scaled up (vertical scaling) as there is a single primary node.
I would recommend initially spending time reviewing the MySQL Docs on Replication. It's a good example of database replication. They are here:
http://dev.mysql.com/doc/refman/5.5/en/replication.html
Covering the entire scope of your question seems like too much for one question.
If you have some specific questions, please feel free to post them. Thanks!
Clustrix is a distributed database with a shared nothing architecture that supports both distributed transactions and replication. There is some technical documentation available that describes data distribution, distributed evaluation model, and built in fault tolerance, as well as an overview of the architecture.
As a MySQL replacement, Clustrix implements MySQL's replication policy and produces binlogs in the MySQL format, which are serialized so that Clustrix can act as either a Master or Slave to MySQL.

Resources