How is data replicated among 2 different TiKv clusters? - tikv

Suppose I have one TiKV cluster deployed in city A and another TiKV cluster deployed in city B. And I want to write data in cluster A and read it in cluster B.
I know that inside cluster A, the data security is enforced by multi-group raft. But how can the data in cluster A be replicated to cluster B and always kept updated? How can inter-cluster replication is performed?

I suggest you deploy one Cluster in city A. And then add one learner replica in city B. Then you can read data in city B directly by follower read.

You can create one of the TiKV cluster between city A and city B, and then use the follower-read feature for this scene.
https://docs.pingcap.com/tidb/stable/follower-read#follower-read

Related

SQL Server Always On: How simulate disks in secondary replica which only exist in primary replica

I have an Always On system with two replicas: A and B. A is addressed to be mainly the primary replica.
Replica A has 2 local disks dedicated to user databases data and user databases logs: D: and L:.
Replica B has an only disk: E:, and I have created two folders in it for data and logs: E:\Data and E:\Log
Nevertheless, until I know, database folders must be the same in both replicas. In replica A I have D: for user database data and L: for user database logs. But I have no such disks in replica B.
How can I simulate the necesary disks in B?
Thanks in advance
It's not a hard requirement to have the same file paths on both servers, but it makes your life easier.
In Windows you can create multiple volumes on a single disk. So on B delete the E volume and add two new volumes on the disk, assigning the driver letters D and L, just like the primary.
Or you could use D:\Data and D:\Logs on both servers by mounting the L volume as D:\logs on the primary, and re-mapping E to D on the secondary.

Clone data across snowflake accounts on schema basis

I have two snowflake account. I want to clone data from one some schemas from database of one account to schemas of database in another account.
Account1: https://account1.us-east-1.snowflakecomputing.com
Database: DB_ONE
SCHEMAS: A1SCHEMA1, A1SCHEMA2, A1SCHEMA3, A1SCHEMA4(has external tables)
Account2: https://account2.us-east-1.snowflakecomputing.com
Database: DB_TWO
SCHEMAS: A2SCHEMA1, A2SCHEMA3, A2SCHEMA4(has external tables)
Both accounts are under same organization.
I want to clone A1SCHEMA1 of DB_ONE from account1 to A2SCHEMA1 of DB_TWO in account2.
Is it possible? If you, what are the instruction. I have found info on db level but not on schema level. Also, I would need to refresh the data from clone on demand basis.
Can I clone the A1SCHEMA4 of DB_ONE from account1 to A2SCHEMA4 of DB_TWO in account2? as it has external tables.
Note: DB_ONE is not created from a share. Basically I want to get data from prod to lower env. replicate or clone but I want to refresh it as well.
Since your goal appears to be to leverage prod data for development purposes, then data sharing isn't a good solution, since it is read-only. Data replication is probably the best solution here, but since you want it at a schema-level, then you're going to need to change things up a bit.
Create a schema-level clone on Prod into a separate database on Prod
On dev, create a database that is a replica of the Prod clone database. This will be a read-only replica of prod, so you'll need the next step.
Once the prod clone is replicated to dev, you can then clone that database/schema into your persistent development structures at a schema-level.
This sounds like a lot of hops of data, but keep in mind that clones are zero-copy, so the only true data movement is across the replication process. This will cost you some replication processing, but since the 2 accounts are in the same region, you will not be charged for data egress and the process will run pretty fast.

Confused about AWS RDS read replica databases. Why can I edit rows?

Edit: I'm not trying to edit the read replica. I'm saying I did edit it and I'm confused on why I was able to.
I have a database in US-West. I made a read replica in Mumbai, so the users in India don't experience slowness. Out of curiosity, I tried to edit a row in the Mumbai read-replica database hoping to get a security error rejecting my write attempt (since after all, it is a READ replica). But the write operation was successful. Why is that? Shouldn't this be a read-only database?
I then went to the master database hoping the writing process would at least be synchronized, but my write execution didn't persist. The master database was now different than the place.
I also tried edited data in the master database, hoping it would replicate it to the slave database, but that failed as well.
Obviously, I'm not understanding something.
Take a look at this link from Amazon Web Service to get an idea:
How do I configure my Amazon RDS DB instance read replica to be modifiable?
Probably your read replica has the flag read_only = false
Modify the newly created parameter group and set the following parameter:
In the navigation pane, choose Parameter Groups. The available DB parameter groups appear in a list.
In the list, select the parameter group you want to modify.
Choose Edit Parameters and set the following parameter to the specified value:
read_only = 0
Choose Save Changes.
I think you should read a little about Cross region read replicas and how they work.
Working with Read Replicas of MariaDB, MySQL, and PostgreSQL DB Instances
Read Replica lag is influenced by a number of factors including the load on both the primary and secondary instances, the amount of data being replicated, the number of replicas, if they are within the same region or cross-region, etc. Lag can stretch to seconds or minutes, though typically it is under one minute.
Reference: https://stackoverflow.com/a/44442233/1715121
Facts to remember about RDS Read Replica
In Read Replica, a snapshot is taken of the primary database.
Read replicas are available in Amazon RDS for MySQL, MariaDB, and PostgreSQL.
Read replicas in Amazon RDS for MySQL, MariaDB, and PostgreSQL provide a complementary availability mechanism to Amazon RDS Multi-AZ Deployments
All traffic between the source and destination database is encrypted for Read Replica’s.
You need to enable backups before creating Read replica’s. This can be done by setting the backup retention period to a value other than 0
Amazon RDS for MySQL, MariaDB and PostgreSQL currently allow you to create up to five Read Replicas for a given source DB Instance
It is possible to create a read replica of another read replica. You can create a second-tier Read Replica from an existing first-tier Read Replica. By creating a second-tier Read Replica, you may be able to move some of the replication load from the master database instance to a first-tier Read Replica.
Even though a read replica is updated from the source database, the target replica can still become out of sync due to various reasons.
You can delete a read replica at any point in time.
I had the same issue. (Old question, but I couldn't find an answer anywhere else and this was my exact issue)
I had created a cross-region read replica, which when complete, all the original data was there, but no updates were synchronised between the two regions.
The issue was the parameter groups.
In my case, I had changed my primary from the default parameter group to one which allowed case insensitive tables. The parameter group is not copied over to the new region, so the replication was failing with:
Error 'Table 'MY_TABLE' doesn't exist' on query. Default database: 'mydb'. Query: 'UPDATE MY_TABLE SET ....''
So in short, create a parameter group in your new region that matches the primary region and assign that parameter group as soon as the replica is created.
I also ran this script on the newly created replica:
CALL mysql.rds_start_replication
I am unsure if this was required, but I ran it anyway.
I think only adding index can be done on slave db in amazon rds if you put read replica in write mode and it will be continue in write mode till you change parameter read_only=1 and apply it immediately.

How to administrate storage of ClickHouse server in a Cluster when disks get full

I'm setting up a ClickHouse server in cluster, but one of the things that doesn't appear in the documentation is how to manage very large amount of data, it says that it can handle up to petabytes of data, but you can't store that much data in single server. You usually will have a few teras in each.
So my question is, how can I handle it to store in a node of the cluster, and then when it requires more space, add another, will it handle the distribution to the new server automatically or will I have to play with the weights in the shard distribution.
When you have more than 1 disk in one server, how can it use them all to store the data?
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
What solution would you find to this? Handling an ever exapanding database to avoid disk space issues in the future.
Thanks
I will assume that you use standard configuration for the ClickHouse cluster: several shards consisting of 2-3 replica nodes, and on each of these nodes a ReplicatedMergeTree table containing data for its respective shard. There are also Distributed tables created on one or more nodes that are configured to query the nodes of the cluster (relevant section in the docs).
When you add a new shard, old data is not moved to it automatically. Recommended approach is indeed to "play with the weights" as you have put it, i.e. increase the weight of the new node until the volume of data is even. But if you want to rebalance the data immediately, you can use the ALTER TABLE RESHARD command. Read the docs carefully and keep in mind various limitations of this command, e.g. it is not atomic.
When you have more than 1 disk in one server, how can it use them all to store the data?
Please read the section on configuring RAID in the administration tips.
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
MergeTree tables in ClickHouse are partitioned by month. You can use ALTER TABLE DETACH/ATTACH PARTITION commands to manipulate partitions. You can e.g. at the start of each month detach the partition for some older month and back it up to Amazon S3. Or you can setup a cluster of cheaper machines with ample disk space and manually move old partitions there. If your queries always include a filter on date, irrelevant partitions will be skipped automatically, else you can setup two Distributed tables: table_recent and table_all (with the cluster config including the nodes with old partitions).
Version 19.15 introduced multidisk strorage configuration. 20.1 introduces time-based data rearrangements.

Distributed FS with deterministic multiple masters?

I'm looking for a distributed file (or other storage) system for managing a very large number of mutable documents. Each document can be rather large (1-100MB). Some reads need to be guaranteed to be working from the latest data, and some can be read from eventually-consistent replicated data. Each document could be a self-contained file (say, a SQLite database or other custom file format).
For optimal performance, the node of the distributed file system on which writes happen for each document must be different. In other words, server A is the master for document 1 and server B is replicating it, but server B is the master for document 2 and server A is replicating it. For my application, a single server is not going to be able to handle all of the write traffic for the whole system, so having a single master for all data is not acceptable.
Each document should be replicated across some number of servers (say, 3). So if I have 1000 documents and 10 servers, each server would have a copy of 300 documents, and be the master for 100 of those. Ideally, the cluster would automatically promote servers to be masters for documents whose master server had crashed, and re-balance the storage load as new servers are added to the cluster.
I realize this is a pretty tall order... is there something available that meets most of my core needs?
I think HDFS would fit the criteria you listed above.

Resources