I want to write the latest record to db with a specific key. It would be easy if I had a time stamp with the record. But I have sequence number of record instead of time stamp.
Furthermore , the sequence numbers are reset to 0 after reaching a large value( 2^16 ). The sequence number can however be reset anytime even if it doesn't reach 2^16.
I have an option of appending all records and reading the one with largest sequence number. But it will cause problems after a reset(since reset can occur any moment).
The other option is to use Lightweight transactions but I'm not sure if it will guarantee concurrency. Also performance might be affected greatly.
How can I go about doing this. I am using Cassandra DB.
For the latest value, its usually done by keeping log of events and reading first record in it. You can always generate a new timestamp (or timeuuid) when you insert it. Something like:
CREATE TABLE record (
id text,
bucket text,
created timeuuid,
info blob,
PRIMARY KEY ((id, bucket), created)
) WITH CLUSTERING ORDER BY (created DESC)
Then SELECT * FROM record WHERE id = 'user1' AND bucket = '2017-09-10' LIMIT 1; where bucket is "today" to prevent partitions getting too large. You have 10k writes per ms per host with timeuuid before you have to worry about collisions.
If you have a linearizable consistency requirement then you will need to use paxos (lightweight transactions, which will guarantee it if used appropriately) or an external locking system like zookeeper. In a distributed system that kind of thing is more complex and you will never get the same throughput you will as normal writes.
Related
We have a requirement where application reads file and inserts data in Cassandra database, however the table can grow up to 300+ MB in one shot during the day.
The table will have below structure
create table if not exists orders (
id uuid,
record text,
status varchar,
create_date timestamp,
modified_date timestamp,
primary key (status, create_date));
'Status' column can have value [Started, Completed, Done]
As per couple of documents on internet, READ performance is best if it's < 100 MB and index should be used on a column that's least modified (so I cannot use 'status' column as index). Also if I use buckets with TWCS as Minutes then there will be lots of buckets and may impact.
So, how can I better make use of partitions and/or buckets for inserting evenly across partitions and reading records with appropriate status.
Thank you in advance.
From the discussion in the comments it looks like you are trying to use Cassandra as a queue and that is a big anti-pattern.
While you could store data about the operations you've done in Cassandra, you should look for something like Kafka or RabbitMQ for the queuing.
It could look something like this:
Application 1 copies/generates record A;
Application 1 adds the path of A to a queue;
Application 1 upserts to cassandra in a partition based on the file id/path (the other columns can be info such as date, time to copy, file hash etc);
Application 2 reads the queue, find A, processes it and determines if it is a failure or if it's completed;
Application 2 upserts to cassandra information about the processing including the status. You can also have stuff like reason for the failure;
If it is a failure then you can write the path/id to another topic.
So to sum it up, don't try to use Cassandra as a queue, that is a globally accepted anti-pattern. You can and should use Cassandra to persist a log of what you have done, including maybe the results of the processing (if applicable), how files were processed, their result and so on.
Depending on how you would further need to read and use the data in Cassandra you could think about using partitions and buckets based on stuff like, source of the file, type of file etc. If not, you could keep it partitioned by a unique value like the UUID I've seen in your table. Then you could maybe come to get info about it based on that.
Hope this heleped,
Cheers!
I am trying to design a timeseries service based on Cassandra that will keep track of some log information.
The database will see a relatively high volume of writes (expecting ~500mil inserts / day) and less frequent but large-volume reads (think one day of data or one month of data).
The simplified data model of one log entry looks like this (in reality it has 50 or so columns):
log_datetime date
log_some_field text
log_some_other_field text
Most read queries will revolve around selecting data from a certain date range, always ordered descending by date. (e.g. SELECT * FROM logs WHERE log_datetime >= 2012-01-01 and log_datetime <= 2012-02-01 ORDER BY log_datetime DESC). This will normally take a considerable amount of time so I'd like to optimize for it as much as possible.
As ordering and filtering by date are the most important features as long as writes are not too terrible, the first idea was defining something like this (where log_day is the day of the year):
CREATE TABLE logs(
log_day tinyint
log_datetime timeuuid
log_some_field text
log_some_other_field text
PRIMARY KEY (log_day, log_datetime)
WITH CLUSTERING ORDER BY (log_datetime DESC)
)
It is my understanding that this would make retrieval as good as it gets as the data is ordered and a single partition is needed to retrieve one day (I can handle in the client the cases where several days are selected).
However, this would make writes go to a single server which would considerably affect write performance. The other option is choosing some random set to be used as partition keys and distribute to each in a round-robin manner from the client, which would make writes faster and scalable but would lead to worse read performance especially if we have to re-sort the data. Most examples that I've seen usually have natural partition keys in the dataset like a user_id or a post_id which is not my case.
Did anybody here have a similar usecase? If so, what tradeoffs did you perform to get decent performance? Do you know of any databases that would perform better in such usecases?
As you note, using day as a partition key means writes go to a single primary node for an entire day. Data is replicated in Cassandra based upon replication factor, typically 3. Thus, three nodes would be written to on any given day.
If the data volume was low, this might be acceptable. Generally it is not and one would use some sort of time bucket, such as 5 or 10 minute intervals computed in the application.
CREATE TABLE logs(
log_day tinyint
timebucket tinyint
log_datetime timeuuid
log_some_field text
log_some_other_field text
PRIMARY KEY ((log_day, timebucket) log_datetime)
WITH CLUSTERING ORDER BY (log_datetime DESC)
)
The choice of an appropriate time interval for the bucket has to do with your expected data volume. With 500M writes per day, that is about 6K per second. Your time buckets could wrap on the hour, so you only have only 6 (using 10 minutes), or span an entire day having 144 unique buckets. When reading results, your application will have to read all buckets for a given day and merge (but not sort) the results.
In a syslog type application, using severity plus day in the partition key could help distribute the load across the cluster with a natural key. It would still be lumpy because the count of info msgs is a lot great than warning, error or fatal messages.
In my database we use composite primary keys generated by user-ID and current date/time:
CREATE TABLE SAMPLE (
Id bigint NOT NULL, -- generated from ticks in UTC timezone
UserId int NOT NULL,
Created timestamp without time zone NOT NULL,
...
PRIMARY KEY (UserId, Id)
)
As Id we use DateTime.UtcNow.Ticks from .NET Framework.
Now I would like to use millisecond Unix Time instead, because it will be easier to use for people who don't know .NET Framework
Are there any potential problems by using Unix Time as composite primary key? I heard that it does not save leap seconds, but I'm not sure if this may cause any real problems if I use it in database for my IDs.
Please note that I don't use generated IDs to get creation date/time - we always have a separate Created field for this. We also never generate more than one record per second, so duplicates are not a problem.
The biggest concern I'd have is that you may have multiple rows created within the same timestamp, creating a conflict between the first row and all subsequent rows.
Unix Time is typically in whole seconds, though even if you increase precision to milliseconds, you could still end up using the same temporarily-cached value for multiple records, depending on the implementation details of how the timestamp was read from the system clock.
Even with DateTime.UtcNow.Ticks, under certain circumstances, multiple calls in a tight loop might return the same value. Same with getutcdate or other SQL-like commands.
If you need an integer unique identifier, better to use an auto-incrementing integer, which is a feature built in to most databases.
As long as they're unique (no more than one per second per other feilds in the composite key) Mysql will allow timestamps for keys them just fine.
However, I'm worried about your claim
We also never generate more than one record per second, so duplicates are not a problem.
I've heard this so many times.
"We'll never have parellel request"
"We'll never get this many requests per second, etc..."
Just warning you, this is tempting fate bigtime and someone will be cursing you later.
Based on your comment, you've added a detection and backoff/retry for conflicts (key denials), keep an eye out if you scale out horizontally because this is where you may still see issue
If your servers for example have slighlty off timestamps you could overlap get frequent collisions even with millisecond timestamps--milli's are not as granular as you think, especially when you scale out (I had this happen a with loadbalanced servers when I tried to create our own UUID function based on timestamps and some other crappy heuristics).
I'd recommend solving it now not to even have it open for chance by using something like increment column in the DB, a UUID, or at least additional random number fields.
The question is directed to experienced Cassandra developers.
I need to count how many times and when each user accessed some resource.
I have data structure like this (CQL):
CREATE TABLE IF NOT EXISTS access_counter_table (
access_number counter,
resource_id varchar,
user_id varchar,
dateutc varchar,
PRIMARY KEY (user_id, dateutc, resource_id)
);
I need to get an information about how many times user has accessed resources for last N days. So, to get last 7 days I make requests like this:
SELECT * FROM access_counter_table
WHERE
user_id = 'user_1'
AND dateutc > '2015-04-03'
AND dateutc <= '2015-04-10' ;
And I get something like this:
user_1 : 2015-04-10 : [resource1:1, resource2:4]
user_1 : 2015-04-09 : [resource1:3]
user_1 : 2015-04-08 : [resource1:1, resource3:2]
...
So, my problem is: old data must be deleted after some time, but Cassandra does not allow set EXPIRE TTL to counter tables.
I have millions of access events per hour (and it could billions). And after 7 days those records will be useless.
How can I clear them? Or make something like garbage collector in Cassandra? Is this a good approach?
Maybe I need to use another data model for this? What could it be?
Thanks.
As you've found, Cassandra does not support TTLs on Counter columns. In fact, deletes on counters in Cassandra are problematic in general (once you delete a counter, you essentially cannot reuse it for a while).
If you need automatic expiration, you can model it using an int field, and perhaps use external locking (such as zookeeper), request routing (only allow one writer to access a particular partition), or Lightweight transactions to safely increment that integer field with a TTL.
Alternatively, you can page through the table of counters and remove "old" counters manually with DELETE on a scheduled task. This is less elegant, and doesn't scale as well, but may work in some cases.
I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.
So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:
The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)
The key is the user reference, and then each activity would be stored as a new column inside a column family.
I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.
What would be the impact in the way I access the data for these 2 approaches?
In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).
Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).
Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).
You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.
Another example to look at is OpenTSDB