Data modeling with counters in Cassandra, expiring columns - database

The question is directed to experienced Cassandra developers.
I need to count how many times and when each user accessed some resource.
I have data structure like this (CQL):
CREATE TABLE IF NOT EXISTS access_counter_table (
access_number counter,
resource_id varchar,
user_id varchar,
dateutc varchar,
PRIMARY KEY (user_id, dateutc, resource_id)
);
I need to get an information about how many times user has accessed resources for last N days. So, to get last 7 days I make requests like this:
SELECT * FROM access_counter_table
WHERE
user_id = 'user_1'
AND dateutc > '2015-04-03'
AND dateutc <= '2015-04-10' ;
And I get something like this:
user_1 : 2015-04-10 : [resource1:1, resource2:4]
user_1 : 2015-04-09 : [resource1:3]
user_1 : 2015-04-08 : [resource1:1, resource3:2]
...
So, my problem is: old data must be deleted after some time, but Cassandra does not allow set EXPIRE TTL to counter tables.
I have millions of access events per hour (and it could billions). And after 7 days those records will be useless.
How can I clear them? Or make something like garbage collector in Cassandra? Is this a good approach?
Maybe I need to use another data model for this? What could it be?
Thanks.

As you've found, Cassandra does not support TTLs on Counter columns. In fact, deletes on counters in Cassandra are problematic in general (once you delete a counter, you essentially cannot reuse it for a while).
If you need automatic expiration, you can model it using an int field, and perhaps use external locking (such as zookeeper), request routing (only allow one writer to access a particular partition), or Lightweight transactions to safely increment that integer field with a TTL.
Alternatively, you can page through the table of counters and remove "old" counters manually with DELETE on a scheduled task. This is less elegant, and doesn't scale as well, but may work in some cases.

Related

Cassandra - Handling partition and bucket for large data size

We have a requirement where application reads file and inserts data in Cassandra database, however the table can grow up to 300+ MB in one shot during the day.
The table will have below structure
create table if not exists orders (
id uuid,
record text,
status varchar,
create_date timestamp,
modified_date timestamp,
primary key (status, create_date));
'Status' column can have value [Started, Completed, Done]
As per couple of documents on internet, READ performance is best if it's < 100 MB and index should be used on a column that's least modified (so I cannot use 'status' column as index). Also if I use buckets with TWCS as Minutes then there will be lots of buckets and may impact.
So, how can I better make use of partitions and/or buckets for inserting evenly across partitions and reading records with appropriate status.
Thank you in advance.
From the discussion in the comments it looks like you are trying to use Cassandra as a queue and that is a big anti-pattern.
While you could store data about the operations you've done in Cassandra, you should look for something like Kafka or RabbitMQ for the queuing.
It could look something like this:
Application 1 copies/generates record A;
Application 1 adds the path of A to a queue;
Application 1 upserts to cassandra in a partition based on the file id/path (the other columns can be info such as date, time to copy, file hash etc);
Application 2 reads the queue, find A, processes it and determines if it is a failure or if it's completed;
Application 2 upserts to cassandra information about the processing including the status. You can also have stuff like reason for the failure;
If it is a failure then you can write the path/id to another topic.
So to sum it up, don't try to use Cassandra as a queue, that is a globally accepted anti-pattern. You can and should use Cassandra to persist a log of what you have done, including maybe the results of the processing (if applicable), how files were processed, their result and so on.
Depending on how you would further need to read and use the data in Cassandra you could think about using partitions and buckets based on stuff like, source of the file, type of file etc. If not, you could keep it partitioned by a unique value like the UUID I've seen in your table. Then you could maybe come to get info about it based on that.
Hope this heleped,
Cheers!

How to handle a content ranking system?

I know the question is worded very badly, so I'll give an example.
Let's say we have a filesystem that stores hundreds of files, and a database with paths to these files.
Each stored path of a file in the database is ranked by a number of likes. The number of likes for a file can go up and down, and it does so frequently.
Now I have a client who would like to get the first 10 descending ranked files on the first page, and the next
10 ranked files on the second page and so on.
How would I handle the frequent changes to the rankings of these files, if we want to display the files in real-time on the client.
Doing a request each time to the database, getting all of the files and then sorting it by likes kind of feels wrong, since the database can possibly get quite large.
I also thought about just having a in-memory cache on the server that stores maybe the first X number of ranked files or even all of them. Would that be better?
Maybe then I could use sockets and for every change in likes of a file I could just inform the clients about it?
I really don't know how to approach this problem, or even what is the correct way of doing these kinds of things.
Any help would be much appreciated.
Thanks!
I think the simplest solution here will be to implement a dedicated counter table. The table will look like this.
CREATE TABLE counter_table (
file_path int(10) unsigned NOT NULL,
like_count int(10) signed DEFAULT '0',
PRIMARY KEY (file_path)
) ENGINE=InnoDB)
Note that I have specified the ENGINE as InnoDB as unlike MyISAM which as table level locking, InnoDB implements row level locking. It means that concurrent queries updating different rows in the same table would not block each other, unlike in MyISAM.
Now you can just update the value per file with a query like this.
UPDATE counter_table SET like_count = like_count+1 where file_path="XYZ";
This solution should serve moderate to high traffic. As you start approaching very high traffic, then you may need to evaluate more stream aggregation based solution like Apache Spark Streaming.

How to get max value from cassandra table with where clause

I have a little design problem, I have the following request:
SELECT MAX(idt) FROM table WHERE idt < 2018
but I can't figure out how to create the table according to this request.
idt must be a clustering key to be able to do greater than or lower than operations as well as the max aggragation but I don't know what should I use as partition key (I don't want to use ALLOW FILTERING).
The only solution I've found is to use a constant value as partition key but I know it's considered as a bad design.
Any help?
Thank you,
You will need to partition your data somehow. If you do not it will be like you say, is to either read everything from whole cluster (allow filtering) or put everything in a single partition (constant key). Not knowing anything about your data, design or goals, a common setup is to partition by date like:
SELECT id FROM table WHERE bucket = '2018' AND id < 100 limit 1;
Then your key would look like ((bucket), id) ordering id DESC so largest at head of partition. In this case buckets are by year so end up making a query per year that your looking for. if idt is not unique you might need to do something like:
((uuid), idt) or ((bucket), uuid, idt) sorting by idt DESC (once again issues if not unique for that record). Then you can do things like
SELECT max(idt) FROM table WHERE GROUP BY bucket
although still better to
SELECT max(idt) FROM table WHERE bucket = '2018' GROUP BY bucket
which will give you the max per bucket so you would have to page through it and generate the global max yourself, but it is better for cluster as it will naturally throttle a little vs single query slamming whole cluster. Might be good idea on that query to also limit the fetch size to like 10 or 100 or something vs the default 5000, so the resultset pages slower (preventing too much work on coordinator).
To have the work to do all this done somewhere else you might wanna consider Spark, as it can give you a lot more rich queries and do it as efficiently as it can (which might not be efficient but it will try).

Writing latest data without using timestamp

I want to write the latest record to db with a specific key. It would be easy if I had a time stamp with the record. But I have sequence number of record instead of time stamp.
Furthermore , the sequence numbers are reset to 0 after reaching a large value( 2^16 ). The sequence number can however be reset anytime even if it doesn't reach 2^16.
I have an option of appending all records and reading the one with largest sequence number. But it will cause problems after a reset(since reset can occur any moment).
The other option is to use Lightweight transactions but I'm not sure if it will guarantee concurrency. Also performance might be affected greatly.
How can I go about doing this. I am using Cassandra DB.
For the latest value, its usually done by keeping log of events and reading first record in it. You can always generate a new timestamp (or timeuuid) when you insert it. Something like:
CREATE TABLE record (
id text,
bucket text,
created timeuuid,
info blob,
PRIMARY KEY ((id, bucket), created)
) WITH CLUSTERING ORDER BY (created DESC)
Then SELECT * FROM record WHERE id = 'user1' AND bucket = '2017-09-10' LIMIT 1; where bucket is "today" to prevent partitions getting too large. You have 10k writes per ms per host with timeuuid before you have to worry about collisions.
If you have a linearizable consistency requirement then you will need to use paxos (lightweight transactions, which will guarantee it if used appropriately) or an external locking system like zookeeper. In a distributed system that kind of thing is more complex and you will never get the same throughput you will as normal writes.

Solution to handling data that frequently changes

I'm currently trying to figure out a solution to best optimize for data that is going to frequently change. The server is running IIS/SQL Server and it is an ASP.NET Web API application. My table structure is something like the following:
User Table:
UserID PK
Status Table:
StatusID PK,
Title varchar
UserStatus Table:
UserID PK (CLUSTERED),
StatusID FK (NON-CLUSTERED),
Date DateTimeOffset (POSSIBLY INDEXED) - This would be used as an expiration. Old records become irrelevant.
There will be roughly 5000+ records in the users table. The status table will have roughly 500 records. The UserStatus table would have frequent changes (change every 5-30 seconds) to the StatusID and Date Fields by anywhere from 0 - 1000 users at any given time. This UserStatus table will also have frequent SELECT queries performed against it as well filtering out records with old/irrelevant dates.
I have considered populating the UserStatus table with a record for
each user and only performing updates. This would mean there would
always be the expected record present and it would limit the checks
for existence. My concern is performance and all of the fragmenting
of the indexes. I would then query against the table for records with
dates that fall within several minutes of the current time.
I have considered only inserting relevant records to the
UserStatus table, updating when they exist for a user, and running
a task that cleans old/irrelevant data out. This method would keep
the number of records down but I would have to check for the
existence of records before performing a task and indexes may inhibit
performance.
Finally I have considered a MemoryCache or something of the like. I
do not know much about caching in a Web API, but from what I have
read about it, I quickly decided against this because of potential
concurrency issues when iterating over the cache.
Does anyone have a recommendation for a scenario like this? Is there another methodology I am not considering?
Given the number of records you are talking about I would use the tsql Merge that will update existing records and add new ones with one efficient statement.
Given the design you mentioned, you should be able to run a periodic maint script that will fix any fragmentation issues.
The solution can be scaled. If the records got the the point where some slowdown was occurring I would consider SSD where fragmentation is not an issue.
If the disadvantages of SSD make that undesirable you can look into in-memory OLTP.

Resources