Cassandra - Handling partition and bucket for large data size - database

We have a requirement where application reads file and inserts data in Cassandra database, however the table can grow up to 300+ MB in one shot during the day.
The table will have below structure
create table if not exists orders (
id uuid,
record text,
status varchar,
create_date timestamp,
modified_date timestamp,
primary key (status, create_date));
'Status' column can have value [Started, Completed, Done]
As per couple of documents on internet, READ performance is best if it's < 100 MB and index should be used on a column that's least modified (so I cannot use 'status' column as index). Also if I use buckets with TWCS as Minutes then there will be lots of buckets and may impact.
So, how can I better make use of partitions and/or buckets for inserting evenly across partitions and reading records with appropriate status.
Thank you in advance.

From the discussion in the comments it looks like you are trying to use Cassandra as a queue and that is a big anti-pattern.
While you could store data about the operations you've done in Cassandra, you should look for something like Kafka or RabbitMQ for the queuing.
It could look something like this:
Application 1 copies/generates record A;
Application 1 adds the path of A to a queue;
Application 1 upserts to cassandra in a partition based on the file id/path (the other columns can be info such as date, time to copy, file hash etc);
Application 2 reads the queue, find A, processes it and determines if it is a failure or if it's completed;
Application 2 upserts to cassandra information about the processing including the status. You can also have stuff like reason for the failure;
If it is a failure then you can write the path/id to another topic.
So to sum it up, don't try to use Cassandra as a queue, that is a globally accepted anti-pattern. You can and should use Cassandra to persist a log of what you have done, including maybe the results of the processing (if applicable), how files were processed, their result and so on.
Depending on how you would further need to read and use the data in Cassandra you could think about using partitions and buckets based on stuff like, source of the file, type of file etc. If not, you could keep it partitioned by a unique value like the UUID I've seen in your table. Then you could maybe come to get info about it based on that.
Hope this heleped,
Cheers!

Related

Writing latest data without using timestamp

I want to write the latest record to db with a specific key. It would be easy if I had a time stamp with the record. But I have sequence number of record instead of time stamp.
Furthermore , the sequence numbers are reset to 0 after reaching a large value( 2^16 ). The sequence number can however be reset anytime even if it doesn't reach 2^16.
I have an option of appending all records and reading the one with largest sequence number. But it will cause problems after a reset(since reset can occur any moment).
The other option is to use Lightweight transactions but I'm not sure if it will guarantee concurrency. Also performance might be affected greatly.
How can I go about doing this. I am using Cassandra DB.
For the latest value, its usually done by keeping log of events and reading first record in it. You can always generate a new timestamp (or timeuuid) when you insert it. Something like:
CREATE TABLE record (
id text,
bucket text,
created timeuuid,
info blob,
PRIMARY KEY ((id, bucket), created)
) WITH CLUSTERING ORDER BY (created DESC)
Then SELECT * FROM record WHERE id = 'user1' AND bucket = '2017-09-10' LIMIT 1; where bucket is "today" to prevent partitions getting too large. You have 10k writes per ms per host with timeuuid before you have to worry about collisions.
If you have a linearizable consistency requirement then you will need to use paxos (lightweight transactions, which will guarantee it if used appropriately) or an external locking system like zookeeper. In a distributed system that kind of thing is more complex and you will never get the same throughput you will as normal writes.

How should large SELECTs be done in Cassandra?

I'm investigating Cassandra as a possible alternative backing store for a data-intensive application, and I'm looking at ways to structure the schema and use CQL to do the kinds of queries we do today using MySQL.
The concrete problem I currently have is: I need to insert, say, 1 million rows into a table. However, if there already exists a row with the right identity (i.e. it's already in the system, identified by a hash), I want to reuse its id for relational reasons. But I only expect an overlap of, say, 10,000 IDs - but of course it could be all 1 million.
Suppose I have a table like this:
create table records_by_hash(hash text primary key, id bigint);
Is it enough to issue a select hash, id from records_by_hash where hash in (...) with all hashes in a multi-megabyte comma-separated list? Is this the best approach for Cassandra?
The way we do this in MySQL is like this:
create temporary table hashes(hash text);
-- file is actually JDBC OutputStream
load data infile '/dev/stdin' into table hashes -- csv format
select id, hash from records join hashes on records.hash = hashes.hash;
Since records is indexed on hashes, and the lookup data is now in MySQL (no more round trips), this is fairly quick and painless. load data is very fast, and there's only three logical round trips.
Using the in operator is most of the time not the best idea because you are hitting multiple partitions (located on random nodes) within same query. It is slow and puts a lot of work on current coordinator node. It not a good idea to have multi megabyte list there.
Check before set is rarely good idea because it doesn't really scale. Also cassandra does not provide you with joins. Depending on your needs you would have to have some sort of script that would check all this before doing the inserts. So you would need check and set etc.
Also an alternative approach for this would be to use spark.
The thing is cassandra won't mind if the hash is already there and you insert some new stuff over it. But this is not something you actually need because you want to keep the references. One possible approach is also to use lightweight transactions so you can use IF NOT EXISTS to perform the insertion only if the row does not already exist. Using IF NOT EXISTS incurs a performance hit associated with using Paxos internally.
In MySQL the ID is usually an AUTO_INCREMENT - there is no parallel for this in Cassandra. Its not clear to me if you are looking to have cassandra create the ID(s) as well or have some other system / db create them for you.
Another thing to note is that MySQL INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE is parallel to cassandra CQL INSERT, that is Cassandra CQL INSERT will update the record if one exists.
You may want to model information in a different manner in Cassandra

Data modeling with counters in Cassandra, expiring columns

The question is directed to experienced Cassandra developers.
I need to count how many times and when each user accessed some resource.
I have data structure like this (CQL):
CREATE TABLE IF NOT EXISTS access_counter_table (
access_number counter,
resource_id varchar,
user_id varchar,
dateutc varchar,
PRIMARY KEY (user_id, dateutc, resource_id)
);
I need to get an information about how many times user has accessed resources for last N days. So, to get last 7 days I make requests like this:
SELECT * FROM access_counter_table
WHERE
user_id = 'user_1'
AND dateutc > '2015-04-03'
AND dateutc <= '2015-04-10' ;
And I get something like this:
user_1 : 2015-04-10 : [resource1:1, resource2:4]
user_1 : 2015-04-09 : [resource1:3]
user_1 : 2015-04-08 : [resource1:1, resource3:2]
...
So, my problem is: old data must be deleted after some time, but Cassandra does not allow set EXPIRE TTL to counter tables.
I have millions of access events per hour (and it could billions). And after 7 days those records will be useless.
How can I clear them? Or make something like garbage collector in Cassandra? Is this a good approach?
Maybe I need to use another data model for this? What could it be?
Thanks.
As you've found, Cassandra does not support TTLs on Counter columns. In fact, deletes on counters in Cassandra are problematic in general (once you delete a counter, you essentially cannot reuse it for a while).
If you need automatic expiration, you can model it using an int field, and perhaps use external locking (such as zookeeper), request routing (only allow one writer to access a particular partition), or Lightweight transactions to safely increment that integer field with a TTL.
Alternatively, you can page through the table of counters and remove "old" counters manually with DELETE on a scheduled task. This is less elegant, and doesn't scale as well, but may work in some cases.

DB optimization to use it as a queue

We have a table called worktable which has some columns(key (primary key), ptime, aname, status, content).
We have something called producer which puts in rows in this table and we have consumer which does an order-by on the key column and fetches the first row which has status 'pending'. The consumer does some processing on this row:
updates status to "processing"
does some processing using content
deletes the row
We are facing contention issues when we try to run multiple consumers (probably due to the order-by which does a full table scan).
Using advanced queues would be our next step but before we go there we want to check what the max throughput is that we can achieve with multiple consumers and producer on the table.
What are the optimizations we can do to get the best numbers possible?
Can we do an in-memory processing where a consumer fetches 1000 rows at a time processes and deletes? will that improve? What are other possibilities? partitioning of table? parallelization? Index organized tables?...
The possible optimizations depend a lot on the database used, but a pretty general approach would be to create an index that covers all fields needed to select the correct rows (it sounds like that would be the key and the status in this case). If the index is created correctly (some database need the correct order of the key elements, others don't), then the query should be much faster.

Database design--billions of records in one table?

Let's say you're creating a database to store messages for a chat room application. There's an infinite number of chat rooms (they're created at run-time on-demand), and all messages need to be stored in the database.
Would it be a mistaken to create one giant table to store messages for all chat rooms, knowing that there could eventually be billions of records in that one table?
Would it be more prudent to dynamically create a table for each room created, and store that room's messages only in that table?
It would be proper to have a single table. When you have n tables which grows by application usage, you're describing using the database itself as a table of tables, which is not how an RDBMS is designed to work. Billions of records in a single table is trivial on a modern database. At that level, your only performance concerns are good indexes and how you do joins.
Billions of records?
Assuming you have constantly 1000 active users with 1 message per minute, this results in 1.5mio messages per day, and approx 500mio messages per year.
If you still need to store chat messages several years old (what for?), you could archive them into year-based tables.
I would definitely argue against dynamic creation of room-based tables.
Whilst a table per chat room could be performed, each database has limits over the number of tables that may be created, so given an infinite number of chat rooms, you are required to create an infinite number of tables, which is not going to work.
You can on the other hand store billions of rows of data, storage is not normally the issue given the space - retrieval of the information within a sensible time frame is however and requires careful planning.
You could partition the messages by a date range, and if planned out, you can use LUN migration to move older data onto slower storage, whilst leaving more recent data on the faster storage.
Strictly speaking, your design is right, a single table. fields with low entropy {e.g 'userid' - you want to link from ID tables, i.e following normal database normalization patterns}
you might want to think about range based partitioning. e.g 'copies' of your table with a year prefix. Or maybe even a just a 'current' and archive table
Both of these approaches mean that your query semantic is more complex {consider if someone did a multi-year search}, you would have to query multiple tables.
however, the upside is that your 'current' table will remain at a roughly constant size, and archiving is more straightforward. - {you can just drop table 2005_Chat when you want to archive 2005 data}
-Ace

Resources