I have been reading articles about Apache Cassandra lately and I pretty perceive partitioning key and clustering key and their difference. But I wonder what is the point of clustering key? Does it help to retrieve data faster?
Clustering key provides uniqueness of the rows inside partition (by combining values of all clustering columns), and organize data in sorted order. Plus when you're retrieving multiple related values, then reading them from the same partition could be faster than retrieving multiple partition keys as you're performing that operation inside one or more replicas that are responsible for given partition.
Related
Lets say I have a cassandra table with the following primary key
((partitionKey1, partitionKey2), clusteringKey1, clusteringKey2)
If I write a query like
SELECT * FROM my_table where clusteringKey1=clusteringKey1Value, ALLOW FILTERING
it is said that the cluster has to read across all the nodes which is fine, since I haven't specified the partition key.
But the data is ordered by clustering key. So shouldn't it be able to use binary search or something to figure out right row for the given clusteringKey1Value? Why does it have to scan all the rows and perform filtering?
It is not efficient to read data in Cassandra without partition key. partition key is important for Cassandra as it allows Cassandra to identify the nodes where that particular cluster. 'partition key`' is like a zipcode for Cassandra to find the node pretty fast.
Consider you have a 100 nodes cluster and replicate data on three nodes. So for example your wanted data resides on node98, node99 and node100. Now you query without partition key then Cassandra does not know where that data will be found, so it has to scan all the nodes. Also clustering key ordering is for rows within a partition. First you read in a partition then you it does apply binary search and other optimizations for quick search within a partition but for reaching a partition it needs partition key.
(Submitting on behalf of a Snowflake User)
QUESTION:
Why would the filter or the search key(key used in where clause) would be a better choice for cluster key than an order by or group by key.
One resource recommends reading: https://support.snowflake.net/s/article/case-study-how-clustering-can-improve-your-query-performance
Another resource mentions:
The performance of query filter will be better because the data is sorted it would skip all the rows which are not required.
For the scenario which has query filter on columns which are not part of sort order but the columns in group by and order by are part of data sort order (clustered keys), it may take time to select those data but the sorting would be easy since the data is already in an order.
A 3rd resource states:
The clustering key is important for the WHERE clause when you only select a small portion of the overall data that you have in your tables, because it can reduce the amount of data that has to be read from the Storage into the Compute when the Optimizer can use the clustering key for Query Pruning.
You can alternatively use the clustering key to optimize table inserts and possibly also query output (eg sort order).
Your choice should depend on your priorities, there is no cure all unless a single key covers all above.
To which the User responds with the following questions:
If I always insert the rows in the order in which they will be retrieved, do I still need to create a cluster key? For example if a table is always queried using a date_timestamp and if I ensure that I am inserting in the table order by date_timestamp, do I still need to create a cluster key on date_timestamp?
Any thoughts, recommendations, etc.? Thanks!
For choosing a cluster key based on FILTER/GROUP/SORT. The first "resource" is right.
If the filter will result in pruning, then it is probably best (so that data can be skipped.) If all/most of the data must be read, then clustering on a GROUP/SORT key is probably fast (so less time is spent re-sorting) These docs state:
Typically, queries benefit from clustering when the queries filter or
sort on the clustering key for the table. Sorting is commonly done for
ORDER BY operations, for GROUP BY operations, and for some joins.
For the second question on natural clustering, there would be little to no performance benefit for defining a cluster key in that case.
This is my diseases table definition:
id text,
drugid text,
name
PRIMARY KEY (drugid, id)
Now I want to perform search by drugid column only (all values in this column are unique). This primary key was created due to quick drug search.
Now - what will be best solution to filter this table using id? Creating new table? Pass additional value (drugid) to SELECT? Is it option with only id?
Thans for help :)
Looking at your table definition, the partition key is drugid. This means that your queries will have to include the drugid. But since id is also part of the primary key, you could do something like:
select * from diseases where drugid = ? and id = ?
Unfortunately just having the id is not possible, unless you create a secondary index on it. Which wouldn't be very good since you could trigger a full cluster scan.
So, the solutions are:
specify the partition key (if possible), in this case drugid
create a new table that will have the id as partition key; in this case you will need to maintain both tables;
I guess the solution you'll choose depends on your data set. You should test to see how each solution behaves.
Should you use a secondary index?
When specifying the partition key, Cassandra will read the exact data from the partition and from only one node.
When you create a secondary index, Cassandra needs to read the data from partitions spread across the whole cluster. There are performance impact implications when an index is built over a column with lots of distinct values. Here is some more reading on this matter - Cassandra at Scale: The Problem with Secondary Indexes
In the above article, there is an interesting comment by #doanduyhai:
"There is only 1 case where secondary index can perform very well and
NOT suffer from scalability issue: when used in conjunction with
PARTITION KEY. If you ensure that all of your queries using secondary
index will be of the form :
SELECT ... FROM ... WHERE partitionKey=xxx AND my_secondary_index=yyy
then you're safe to go. Better, in this
case you can mix in many secondary indices. Performance-wise, since
all the index reading will be local to a node, it should be fine"
I would stay away from secondary indexes.
From what you described, id will have distinct values, more or less, so you might run into performance issues since "a general rule of thumb is to index a column with low cardinality of few values".
Also, if id is a clustering column, the data will be stored in an ordered manner. The clustering column(s) determine the data’s on-disk sort order only within a partition key. The default order is ASC.
I would suggest some more reading - When not to use an index and Using a secondary index
I have a column family that I expose to some application via DataStax Enterprise Search's SolR HTTP API. In some use cases, I thought it might be preferable directly accessing the cql layer.
When taking a closer look at the underlying data model though, I see that the unique in SolR is mapped to the partition key in Cassandra, not making use of compound keys with clustering columns.
Won't this produce a single wide row per partition?
And isn't that a "poor" data model for large data sets?
The unique key in your Solr schema should be a comma-separated list of all of the partition and clustering columns, enclosed within parentheses. Composite partition keys are supported as well as compound primary keys.
See the doc:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/srch/srchConfSkema.html
Yes, you do get a single wide storage row for each partition key, but it's your choice whether a column in your Cassandra primary key should be used as a clustering column or in the partition key. If you feel that your storage rows in Cassandra are two wide, move one of the clustering columns into a composite partition key, or add another column for that purpose.
Balancing the number of partitions and partition width is of course critical, but DSE/Solr is not restricting your choice.
Can HashTables be used to create indexes in databases? What is the ideal Data structure to create indexes?
If a table has has a foreign key referencing a field in other database does will it help if we create index on the foreign key?
Can HashTables be used to create indexes in databases?
Some DBMSes support hash-based indexes, some don't.
What is the ideal Data structure to create indexes?
No data structure occupies 0 bytes, nor it can be manipulated in 0 CPU cycles, therefore no data structure is "ideal". It is upon us, the software engineers, to decide which data structure has most benefits and fewest detriments to the specific goal we are trying to accomplish.
For example, B-Trees are useful for range scans and hash indexes aren't. Does that mean the B-Trees are "better"? Well, they are if you need range scans, but may not necessarily be if you don't.
If a table has has a foreign key referencing a field in other database does will it help if we create index on the foreign key?
You can not normally have a foreign key toward another database, only another table.
And yes, it tends to help, since every time a row is updated or deleted in the parent table, the child table needs to be searched to see if the FK was violated. This search can significantly benefit from such an index. Many (but not all) DBMSes require index on FK (and might even create it automatically if not already there).
OTOH, if you only add rows to the parent table, you could consider leaving the child table unindexed on FK fields (assuming your DBMS allows you to do so).
Oracle Perspective
Oracle supports clustering by hash value, either for single or multiple tables. This physically colocates rows having the same hash value for the cluster columns, and is faster than accessing via an index. There are disadvantages due to increased complexity and a certain need for preplanning.
You could also use a function-based index to index based on a hash function applied to one or more columns. I'm not sure what the advantage of that would be though.
Foreign key columns in Oracle generally benefit from indexing due to the obvious performance advantages.