I have model which looks like
StateChange:
row_id
group_name
timestamp
user_id
I aim to query as follows:
Query 1 = Find all state changes with row_id = X ORDER BY Timestamp DESC
Query 2 = Find all state changes with row_id = X and group_name = Y ORDER BY Timestamp DESC
Using my limited CQL knowledge, the only way to do so was to create 2 query tables one for each query mentioned above
For query 1:
CREATE TABLE state_change (
row_id int,
user_id int,
group_name text,
timestamp timestamp,
PRIMARY KEY (row_id, timestamp)
)
For query 2:
CREATE TABLE state_change_by_group_name (
row_id int,
user_id int,
group_name text,
timestamp timestamp,
PRIMARY KEY ((row_id, group_name), timestamp)
)
This does solve the problem but I have duplicated data in Cassandra now.
Note: Creating an group_name index on table works but I cannot ORDER BY timestamp anymore as its is the secondary index now.
Looking for a solution which requires only one table.
The solution you're looking for does not exists. Two different queries requires two different tables (or at least a secondary index which creates a table under the hood). Denormalization is the norm in Cassandra so you should not think at data duplication as an anti-pattern -- indeed it's the suggested pattern
Carlo is correct in that your multiple table solution is the proper approach here.
This does solve the problem but I have duplicated data in Cassandra now.
...
Looking for a solution which requires only one table.
Planet Cassandra recently posted an article on this topic: Escaping From Disco-Era Data Modeling
(Full disclosure: I am the author)
But two of the last paragraphs really address your point (especially, the last sentence):
That is a very 1970′s way of thinking. Relational database theory
originated at a time when disk space was expensive. In 1975, some
vendors were selling disk space at a staggering eleven thousand
dollars per megabyte (depending on the vendor and model). Even in
1980, if you wanted to buy a gigabyte’s worth of storage space, you
could still expect to spend around a million dollars. Today (2014),
you can buy a terabyte drive for sixty bucks. Disk space is cheap;
operation time is the expensive part. And overuse of secondary
indexes will increase your operation time.
Therefore, in Cassandra, you should take a query-based modeling
approach. Essentially, (Patel, 2014) model your column families
according to how it makes sense to query your data. This is a
departure from relational data modeling, where tables are built
according to how it makes sense to store the data. Often, query-based
modeling results in storage of redundant data (and sometimes data that
is not dependent on its primary row key)…and that’s ok.
Related
My program generates large amount time-series data into the following table:
CREATE TABLE AccountData
(
PartitionKey text,
RowKey text,
AccountId uuid,
UnitId uuid,
ContractId uuid,
Id uuid,
LocationId uuid,
ValuesJson text,
PRIMARY KEY (PartitionKey, RowKey)
)
WITH CLUSTERING ORDER BY (RowKey ASC)
The PartitionKey is a dictionary value (one of 10) and the RowKey is DateTime converted to long.
Now due to the crazy amount of data that is being generated by the program, every ContractId has a different retention policy in the code. The code goes and deletes old data based on the retention for the specific ContractId.
I am now running into problems where during a SELECT statement it picks up too many Tombstones and I get an error.
What Table Compaction strategy should I use to solve this Tombstone problem?
PartitionKey is a dictionary value (one of 10)
I think this is likely your problem. Basically, all of the data in the cluster is ending up on 10 partitions. Those are going to get extremely large as time progresses. In general, you want to keep your partitions between 1MB-10MB in size. The lower the better.
I would recommend splitting the partition up. If it's time related, take a time unit which makes the most sense to your query pattern. For example, if most of the queries are month-based, perhaps something like this might work:
PRIMARY KEY ((month,PartitionKey),RowKey)
That will create a partition for each combination of month and the current PartitionKey.
Likewise, most time series use cases tend to query most-recent data more often. To that end, it usually makes sense to sort data in the partitions by time, in descending order. That is of course, if RowKey is indeed a data/time value.
WITH CLUSTERING ORDER BY (RowKey DESC)
Also, a nice little side-effect of this model, is that any old data which is tombstoned is now at the "bottom" of the partition. So, depending on the delete patterns, tombstones will still exist. But if the data is clustered in descending order...the tombstones are never/rarely queried.
What Table Compaction strategy should I use to solve this Tombstone problem?
So I do not believe that simply changing the compaction strategy will be the silver bullet to solve this problem. That being said, I suggest looking into the TimeWindowCompactionStrategy. That compaction strategy stores its SSTable files by a designated time period (window). This prevents files full of old, obsoleted, or tombstoned data from being queried.
I'm working on synchronizing clients with data for eventual consistency. The server will publish a list of database ids and rowversion/timestamp. Client will then request data with incorrect version number. The primary reason for inconsistent data is networking issues between broker nodes, split brain, etc.
When I read data from my tables, I request data based on a predicate that is not the primary key.
I iterate available regions to read data per region. This is my select:
SELECT DatabaseId, VersionTimestamp, OperationId
FROM TableX
WHERE RegionId = 1
Since this leads to an index scan per query, I'm wondering if a non-clustered index on my RegionId column, and include the selected columns in that index:
CREATE NONCLUSTERED INDEX [ID_TableX_RegionId_Sync]
ON [dbo].[TableX] ([RegionId])
INCLUDE ([DatabaseId],[VersionTimestamp],[OperationId])
VersionTimestamp is rowversion/timestamp column, and will of course change whenever a row is updated, so I'm wondering if it is a poor design choice to include this column in an index since it will need to be updated at every insert/update/delete?
Since this will result in n index scans, rather than n index seeks, it might be better to read all the data once, and then group by regionId and fill in empty lists of rows where a regionId doesn't have any data.
The real life scenario is a bit more complicated, as there are table relationships that will also have to be queried. I haven not yet looked at including one to many relationships in my version queries.
This is primarily about better understanding the impact of covering indexes and figuring out how to better use them. Since I am going to read all the data from the table in any case, it is probably cheaper to load them all at once. However, reading them as from the query above, it makes my code a lot cleaner for this simple no-relationship example alone.
Edit:
Alternative 2
Another option that came to mind, is creating a covering index on RegionId, and include my primary key (DatabaseId).
SELECT DatabaseId
FROM TableX WHERE RegionId=1
And then a new query where I select the needed columns WHERE DatabaseId IN(list, of, databaseId)
For the current scenario, there are only max thousands of rows in the table, and not in the millions. Network traffic for the two (x n) queries might most likely outweigh the benefits of using indexes, and be premature optimization.
I have one table having the columns like below:
symbol
region
country
location
date
count
I created this table like below:
CREATE TABLE IF NOT EXISTS INFO (symbol varchar, region varchar, country varchar, location varchar, date date,count varint, PRIMARY KEY(symbol,date));
Now I have the query set which needs to supports to this table:
select * from info where symbol='AAA';
select * from info where date='2017-01-01';
select * from info where count < 5;
select * from info where country='XXX';
select * from info where location='XYZ';
select * from info where region='PQR';
These all queries are not working.
In simple words, I want table structure which supports all or any number of columns in the where clause.
Is it possible to do this in Cassandra?
It looks to me like you need to learn some Cassandra data modelling. I recommend you go to https://academy.datastax.com/courses and watch some courses (more specifically DS210 and DS220), they're free after a simple registration. This is in my opinion the best way to learn Cassandra. I know they're long but they're incredibly useful.
To answer your question. You always have to specify the partition key (symbol in your case) in your query and this is why: When you insert data Cassandra will hash the primary key and store the data on the node that is responsible for that hash (this is called range). So if you have 1000 nodes in your cluster and you run one of the SELECT queries you specified then how will Cassandra know what node has the data? It is possible to search all nodes for the data you want by using ALLOW FILTERING but as you can imagine this is terrible for performance. Here is a reference for better understanding: https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key
The way to solve this is by creating multiple tables with the same data, but different partition key. Yes, this will make a lot of redundant data but is that really that bad?
The first cost of this will be that you need to buy more disk space. But disk space is cheap so it's not really that big of a problem. CPU is more expensive.
The second cost is that you have to do multiple writes to keep your tables consistent. But compared to SQL databases Cassandra is extremely fast with writing data. Reads are more expensive but that won't matter in your case since you will only read data once anyway.
So how should you do this practically?
In your case you will have to create a new table for each new partition key that you need. That is create 4 new tables with date, country, location and region as partition key.
For the select statement with count < 5 it gets a little more complicated. Like I stated before Cassandra wants to know exactly what partition your data is located in. So making count a partition key won't really help. You need to have a primary key specified in your query as well. Like this:
select * from info where symbol='AAA' AND count < 5;
However, since count isn't a clustering key this won't work either. A clustering is used to sort your data inside of a partition. You can have as many clustering keys as you want in your table. The clustering key is part of the primary key. The first part of the primary key is ALWAYS the partition key. All that comes after are clustering keys.
CREATE TABLE IF NOT EXISTS INFO (symbol varchar, region varchar, country varchar, location varchar, date date,count varint, PRIMARY KEY(symbol,date,count,));
I know this is all confusing for a beginner but just remember that Cassandra is not a SQL database. Try watching some of the videos I linked and read about the different concepts on the Datastax documentation (it's still alot better than the official Cassandra documentation).
Here is a glossary for some of the terms I just used: https://docs.datastax.com/en/glossary/doc/glossary/glossaryTOC.html
DB design 1:
There is 1 table
Create Table (id int primary key, name varchar(20), description varchar(10000));
DB design 2:
There are 2 tables
Create Table1 (id int primary key, name varchar(20));
Create Table2 (id int primary key, description varchar(10000));
Note: each id must have a description associated with it. We don't query the description so often like name.
In the design 1, 1 simple query can get name & description, no need join but what if we have 1 million records, then will it be slow?
In the design 2, we need join so the database needs some searching & matching id --> this could be slow, but we don't query description often so it will be slow for sometime not all time.
So what is the better design in this scenario?
That's called vertical partitioning or "row splitting" and is no silver bullet (nothing is). You are not getting "better performance" you are just getting "different performance". Whether one set of performance characteristics is better than the other is a matter of engineering tradeoff and varies from one case to another.
In your case, 1 million rows will fit comfortably into DBMS cache on today's hardware, producing excellent performance. So unless some of the other reasons apply, keep it simple, in a single table.
And if its 1 billion rows (or 1 trillion or whatever number is too large for the memory standards of the day), keep in mind that if you have indexed your data correctly, the performance will remain excellent long after it became bigger than the cache.
Only in the most extreme of cases will you need to vertically partition the table for performance reasons - in which case you'll have to measure in your own environment with your own access patterns, and determine if it brings any performance benefit at all; and is it large enough to make up for the increased JOINing.
That's over-optimization for 1 million records in my opinion. It's really not that much. You could try to test the actual performance by generating dummy data on about a million rows for a dummy database and query it. You'll see how it performs.
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.