Cassandra DB - No active nodes? - database

I'm new to Cassandra and I'm trying to make a basic Cassandra server but I am having difficulties. Through some sheer miracle, I've managed to create a keyspace and some tables. However, whenever I try interacting with the tables, I get the following error:
"Unable to execute CQL script on 'Localhost': not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
The message lead me to believe I have no active nodes, but I have cassandra.bat (I'm on win10) running in the background and that has allowed me to connect and create keyspaces and tables.
Moreover, when I try doing anything with nodetool, it processes indefinitely (or takes very long time, I'm too impatient to find out but I guessed the former due to my previous assumption).
My keyspace is NetworkTopologyStrategy with 1 datacenter of a replication factor 3 and durable write enabled.
Anybody has any ideas what's wrong?

First, you're specified replication factor equal to 3, although you have only one node. Second - you need to check what datacenter name you did specify in the NetworkTopologyStrategy - you can find it if you execute nodetool status. After that make changes into existing keyspace using command:
ALTER KEYSPACE keyspace_name
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'datacenter_name' : 1};
P.S. I recommend to watch DS201/210/220 courses on the DataStax Academy - this will give you a good overview of Cassandra, base operations, and data modelling.

Related

Fixing Cassandra Database

My co-worker and I have been thrown into a project that uses Cassandra with no introductions.
Alright, let's do this!
SELECT * FROM reports WHERE timestamp < '2019-01-01 00:00:00' ALLOW FILTERING;
Error: 1300
Apparently, we have too many tombstones. What's that?
A tombstone is deleted data that hasn't been removed yet for performance reasons.
Tombstones should be deleted with nodetool repair before the gc_grace_period has expired, default is 10 days.
Now, this project is around 7 years old and it doesn't seem like there's a job that runs repair.
According to default warning and error values, 1K tombstones are a lot. We find about 1.4M.
We measured the number of tombstones with Tracing on, running a SELECT query, and accumulating the tombstones reported.
We tried to run nodetool repair --full -pr -j 4 but we get Validation failed in /10.0.3.1.
DataStax's guide to repairing repairs wants us to fix the validation error with nodetool scrub.
But we still get the same error afterwards.
The guide then wants us to run sstablescrub, which failed with an out-of-memory exception.
Going back to our original problem of deleting data before 2019, we tried to run DELETE FROM reports WHERE timestamp < '2019-01-01 00:00:00'.
However, timestamp is not our partition key so we are not allowed to delete data like this, which has also been confirmed by many other StackOverflow posts and an DataStax issue on Jira.
Every post mentions that we should "just" change the schema of our Cassandra database to fit our queries.
First, we only need to do this once; second, our client wants to have this data deleted as soon as possible.
Is there a way of easily changing the schema of a Cassandra database?
Is there a way that we can make a slow solution that at least works?
All in all, we are new to Cassandra and we are unsure on how to proceed.
What we want is
delete all data from before 2019 and confirm that it is deleted
have stable selects, avoiding error 1300
Can you help?
We have 4 nodes running in Docker on Azure if that is necessary to know.
The version of Cassandra is 3.11.6.
Tombstones could exist in the SSTables longer than 10 days because they are evicted during compaction, and if it didn't happen for a long time, then they just stay there. You have following options available (for 3.11.x):
if you have disk space you may force compaction using the nodetool compact -s that will combine all SSTables into several SSTables - this will put a lot of load onto the system as it will read all data & write them back
use nodetool garbagecollect to evict old data & expired tombstones - but it may not delete all tombstones
you can tune parameters of the specific table so compaction will happen more often, like, decrease the minimal number of SSTables for compaction from 4 to 2, plus some other options (min_threshold, tombstone_threshold, etc.)
In future, for repairs it's recommended to use something like Reaper, that performs token range repair, putting less load onto the system.
Mass deletion of data could be done by external tools, for example:
Spark + Spark Cassandra Connector - see this answer for example
DSBulk - you can use the -query option to specify your query to unload data to disk (only columns of the primary key, and use :start/:end keywords), and then loading data providing the -query 'DELETE FROM table WHERE primary_key = ....'
And for schema change - it's not the most trivial task. To match your table structure to queries you most probably will need to change the primary key, and in Cassandra this is is done only via creation of the new table(s), and loading data into these new tables. For that task you'll also need something like Spark or DSBulk, especially if you'll need to migrate data with TTL and/or WriteTime. See this answer for more details.

Confused about AWS RDS read replica databases. Why can I edit rows?

Edit: I'm not trying to edit the read replica. I'm saying I did edit it and I'm confused on why I was able to.
I have a database in US-West. I made a read replica in Mumbai, so the users in India don't experience slowness. Out of curiosity, I tried to edit a row in the Mumbai read-replica database hoping to get a security error rejecting my write attempt (since after all, it is a READ replica). But the write operation was successful. Why is that? Shouldn't this be a read-only database?
I then went to the master database hoping the writing process would at least be synchronized, but my write execution didn't persist. The master database was now different than the place.
I also tried edited data in the master database, hoping it would replicate it to the slave database, but that failed as well.
Obviously, I'm not understanding something.
Take a look at this link from Amazon Web Service to get an idea:
How do I configure my Amazon RDS DB instance read replica to be modifiable?
Probably your read replica has the flag read_only = false
Modify the newly created parameter group and set the following parameter:
In the navigation pane, choose Parameter Groups. The available DB parameter groups appear in a list.
In the list, select the parameter group you want to modify.
Choose Edit Parameters and set the following parameter to the specified value:
read_only = 0
Choose Save Changes.
I think you should read a little about Cross region read replicas and how they work.
Working with Read Replicas of MariaDB, MySQL, and PostgreSQL DB Instances
Read Replica lag is influenced by a number of factors including the load on both the primary and secondary instances, the amount of data being replicated, the number of replicas, if they are within the same region or cross-region, etc. Lag can stretch to seconds or minutes, though typically it is under one minute.
Reference: https://stackoverflow.com/a/44442233/1715121
Facts to remember about RDS Read Replica
In Read Replica, a snapshot is taken of the primary database.
Read replicas are available in Amazon RDS for MySQL, MariaDB, and PostgreSQL.
Read replicas in Amazon RDS for MySQL, MariaDB, and PostgreSQL provide a complementary availability mechanism to Amazon RDS Multi-AZ Deployments
All traffic between the source and destination database is encrypted for Read Replica’s.
You need to enable backups before creating Read replica’s. This can be done by setting the backup retention period to a value other than 0
Amazon RDS for MySQL, MariaDB and PostgreSQL currently allow you to create up to five Read Replicas for a given source DB Instance
It is possible to create a read replica of another read replica. You can create a second-tier Read Replica from an existing first-tier Read Replica. By creating a second-tier Read Replica, you may be able to move some of the replication load from the master database instance to a first-tier Read Replica.
Even though a read replica is updated from the source database, the target replica can still become out of sync due to various reasons.
You can delete a read replica at any point in time.
I had the same issue. (Old question, but I couldn't find an answer anywhere else and this was my exact issue)
I had created a cross-region read replica, which when complete, all the original data was there, but no updates were synchronised between the two regions.
The issue was the parameter groups.
In my case, I had changed my primary from the default parameter group to one which allowed case insensitive tables. The parameter group is not copied over to the new region, so the replication was failing with:
Error 'Table 'MY_TABLE' doesn't exist' on query. Default database: 'mydb'. Query: 'UPDATE MY_TABLE SET ....''
So in short, create a parameter group in your new region that matches the primary region and assign that parameter group as soon as the replica is created.
I also ran this script on the newly created replica:
CALL mysql.rds_start_replication
I am unsure if this was required, but I ran it anyway.
I think only adding index can be done on slave db in amazon rds if you put read replica in write mode and it will be continue in write mode till you change parameter read_only=1 and apply it immediately.

Synchronize data b/w two data stores

I have two different databases, one's an old legacy one which I'll be decommissioning due to the old service not being used anymore. The other one's is a new service and will eventually replace the old system. Before that happens we need both services running for a while.
Both have two tables for users for storing the email address, password and the other table is for simple user related data (addresses.)
I need to synchronize data between these two databases. The old one is a MS SQL Server DB and the new one's a NoSQL DB, (DynamoDB.)
My strategy would be that before going live, copy all the users from the old DB to the new one and then once the new system is running then synchronize the users between each DB.
I'll do this by having a tool run periodically to check any users added after last run by querying the users table something like this WHERE CreationDate >= LastRunTime and then for each user query it if it exists in the other database. I'll do this two way i.e. from old DB -> new DB and from new DB -> old DB.
Is this a good way of doing this? Any other better, fast solutions to achieve this?
How can I detect changes to existing user's data? Is there any better solution than checking & matching every user's record in both systems' tables and then taking the one that's last modified (by checking at the LastModifiedDate timestamp for each record) and updating it in the other system's table?
Solution 1 (My Recommended): Whenever system insert/update a record in either of the databases you add/update a record data in the database and add that information in a Queue.
A sperate reader will read from the queue and replicate the data to respective database periodically this way your data will get sync between the databases.
Note: Another advantage of using the queue would be that you don't have to set very high throughput in your DynamoDB table.
Solution 2: What you had suggested in your question, you can add a CRON job that will replicate the databases by checking the record based on timestamp.
I've executed several table migrations from Oracle / MySQL to DynamoDB with no downtime and the approach I used was a little different than what you described. This approach ends up requiring more coding but I would consider it a lower risk approach than the hard cutover you described.
This approach requires multiple phases as described below:
Phase 1
Create the new DynamoDB table(s) for the data in your legacy system.
Phase 2
Update your application to write/update data in both the legacy database and in DynamoDB. Your application will still read and write to the legacy system so this should be a low risk change.
Immediately before deploying this code load DynamoDB up with all of the old data.
Immediately after deploying audit the database to make sure they are in sync.
Phase 3
Update your application to start reading from DynamoDB. This should be low risk because your application will have been maintaining data in DynamoDB for some time.
Keep your application writing to the legacy database so you can cut back if you identify any problems in the new implementation. This ensures the cutover is low risk and you can easily roll back.
Phase 4
Remove the code from your application that reads and writes to the legacy database and deploy this to production.
You can now decommission the legacy database!
This is definitely more steps and will take more time than just taking the application down, migrating all of the data, and then deploying a new version of the application to read/write from DynamoDB. However, the main benefit to this approach is that it not only requires no downtime but is lower risk as it tests the change in phases and allows for easy rollback if any issues are encountered.
On high level, a sync job could be 1> cron job based or 2> notification based.
The cron job could do sync as well as auditing if you have "creation time" and "last_updated_by time". In this case the master DB (from where the data should be synced from) is normally a SQL Db since it's much easier to do table scan in SQL than in NoSQL (like in DynamoDB you need to use its scan function and it's limited by the table's hash key).
The second option is to build a notification machenism and this could be based on DynamoDB's stream http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html. It's a mature feature for DynamoDB, it guarantees event order and could achieve near real time event deliver. What you need to do is to build a listen for those events.
Lastly, you could take a look at AWS Database Migration Service https://aws.amazon.com/dms/ to see if it satisfies your requirement.

Warehouse PostgreSQL database architecture recommendation

Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.

Cassandra Transaction with ZooKeeper - Does this work?

I am trying to implement a transaction system for Cassandra with the help of ZooKeeper. Since I don't think I have enough experience in database implementation, I would like to know if my idea would work in principle, or is there any major flaw.
Here is the high level description of the steps:
identify all the rows(keys) and columns to be edited. Let the keys be [K0..Kn]
apply write lock on all the rows involved (locks are in-memory Zookeeper implementation)
copy the old values to separate locations in Cassandra which are uniquely identified by key: [K'0..K'n]
store [K'0..K'n] and the mapping of them to [K0..Kn] in ZooKeeper using persistent mode
go ahead apply the update to the data
delete the entries in ZooKeeper
unlock the rows
delete the entries of [K'0..K'n] lazily on a maintenance thread (cassandra deletion uses timestamp, so K'0..K'n can be reused for another transaction with a newer time stamp)
Justification:
if the transaction failed on step 1-4, no change is applied, I can abort the transaction and delete whatever is stored in zookeeper and backup-ed in cassandra, if any.
if the transaction failed on step 5, the information saved on step 3 is used to rollback the any changes.
if the server happen to be failed/crashed/stolen by cleaning man, upon restart before serving any request, I check if there is any keys persisted in the zookeeper from step 4, if so, i will use those keys to fetch backed up data stored by step 3, and put those data to where they were, thus roll-back any failed transactions.
One of my concern is what would happen if some of the servers are partitioned from the cluster. I have no experience in this area, does my scheme work at all? and does it work if partition happens?
You should look into Cages: http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
http://code.google.com/p/cages/

Resources