I have a column family that I expose to some application via DataStax Enterprise Search's SolR HTTP API. In some use cases, I thought it might be preferable directly accessing the cql layer.
When taking a closer look at the underlying data model though, I see that the unique in SolR is mapped to the partition key in Cassandra, not making use of compound keys with clustering columns.
Won't this produce a single wide row per partition?
And isn't that a "poor" data model for large data sets?
The unique key in your Solr schema should be a comma-separated list of all of the partition and clustering columns, enclosed within parentheses. Composite partition keys are supported as well as compound primary keys.
See the doc:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/srch/srchConfSkema.html
Yes, you do get a single wide storage row for each partition key, but it's your choice whether a column in your Cassandra primary key should be used as a clustering column or in the partition key. If you feel that your storage rows in Cassandra are two wide, move one of the clustering columns into a composite partition key, or add another column for that purpose.
Balancing the number of partitions and partition width is of course critical, but DSE/Solr is not restricting your choice.
Related
I have been reading articles about Apache Cassandra lately and I pretty perceive partitioning key and clustering key and their difference. But I wonder what is the point of clustering key? Does it help to retrieve data faster?
Clustering key provides uniqueness of the rows inside partition (by combining values of all clustering columns), and organize data in sorted order. Plus when you're retrieving multiple related values, then reading them from the same partition could be faster than retrieving multiple partition keys as you're performing that operation inside one or more replicas that are responsible for given partition.
Currently, I am dealing with Cassandra.
While reading a blog post, it is said:
When issuing a CQL query, you must include all partition key columns,
at a minimum.
(https://shermandigital.com/blog/designing-a-cassandra-data-model/)
However, in my database it seems like it possible without including all partition keys. Here the table:
CREATE TABLE usertable (
personid text,
name text,
"timestamp" timestamp,
active boolean,
PRIMARY KEY ((personid, name), timestamp)
) WITH
CLUSTERING ORDER BY ("timestamp" DESC)
AND comment=''
AND read_repair_chance=0
AND dclocal_read_repair_chance=0.1
AND gc_grace_seconds=864000
AND bloom_filter_fp_chance=0.01
AND compaction={ 'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold':'32',
'min_threshold':'4' }
AND compression={ 'chunk_length_in_kb':'64',
'class':'org.apache.cassandra.io.compress.LZ4Compressor' }
AND caching={ 'keys':'ALL',
'rows_per_partition':'NONE' }
AND default_time_to_live=0
AND id='23ff16b0-c400-11e8-55c7-2b453518a213'
AND min_index_interval=128
AND max_index_interval=2048
AND memtable_flush_period_in_ms=0
AND speculative_retry='99PERCENTILE';
So I can do select * from usertable where personid = 'ABC-02';. However, according to the blog post, I have to include timestamp as well.
Can someone explain this?
In cassandra, partition key spreads data around cluster. It computes the hash of partition key and determine the location of data in the cluster.
One exception is, if you use ALLOW FILTERING or secondary index it does not require you too include all partition keys in where query.
For further information take a look at blog post:
The purpose of a partition key is to split the data into partitions
where an entire partition is stored on a single node in the cluster
(with each node storing many partitions). When data is read or written
from the cluster, a function called Partitioner is used to compute the
hash value of the partition key. This hash value is used to determine
the node/partition which contains that row. The clustering key is used
further to search for a row within a given partition.
Select queries in Apache Cassandra look a lot like select queries from
a relational database. However, they are significantly more
restricted. The attributes allowed in ‘where’ clause of Cassandra
query must include the full partition key and additional clauses may
only reference the clustering key columns or a secondary index of the
table being queried.
Requiring the partition key attributes in the ‘where’ helps Cassandra
to maintain constant result-set retrieval time as the cluster is
scaled-out by allowing Cassandra to determine the partition, and thus
the node (and even data files on disk), that the query must be
directed to.
If a query does not specify the values for all the columns from the
primary key in the ‘where’ clause, Cassandra will not execute it and
give the following warning :
‘InvalidRequest: Error from server: code=2200 [Invalid query]
message=”Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW
FILTERING” ‘
https://www.instaclustr.com/apache-cassandra-scalability-allow-filtering-partition-keys/
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
According to your schema, your timestamp column is the clustering column, the sorting column, no part of the partition key. That’s why it is not required.
(personid, name) are your partitions columns.
This is my diseases table definition:
id text,
drugid text,
name
PRIMARY KEY (drugid, id)
Now I want to perform search by drugid column only (all values in this column are unique). This primary key was created due to quick drug search.
Now - what will be best solution to filter this table using id? Creating new table? Pass additional value (drugid) to SELECT? Is it option with only id?
Thans for help :)
Looking at your table definition, the partition key is drugid. This means that your queries will have to include the drugid. But since id is also part of the primary key, you could do something like:
select * from diseases where drugid = ? and id = ?
Unfortunately just having the id is not possible, unless you create a secondary index on it. Which wouldn't be very good since you could trigger a full cluster scan.
So, the solutions are:
specify the partition key (if possible), in this case drugid
create a new table that will have the id as partition key; in this case you will need to maintain both tables;
I guess the solution you'll choose depends on your data set. You should test to see how each solution behaves.
Should you use a secondary index?
When specifying the partition key, Cassandra will read the exact data from the partition and from only one node.
When you create a secondary index, Cassandra needs to read the data from partitions spread across the whole cluster. There are performance impact implications when an index is built over a column with lots of distinct values. Here is some more reading on this matter - Cassandra at Scale: The Problem with Secondary Indexes
In the above article, there is an interesting comment by #doanduyhai:
"There is only 1 case where secondary index can perform very well and
NOT suffer from scalability issue: when used in conjunction with
PARTITION KEY. If you ensure that all of your queries using secondary
index will be of the form :
SELECT ... FROM ... WHERE partitionKey=xxx AND my_secondary_index=yyy
then you're safe to go. Better, in this
case you can mix in many secondary indices. Performance-wise, since
all the index reading will be local to a node, it should be fine"
I would stay away from secondary indexes.
From what you described, id will have distinct values, more or less, so you might run into performance issues since "a general rule of thumb is to index a column with low cardinality of few values".
Also, if id is a clustering column, the data will be stored in an ordered manner. The clustering column(s) determine the data’s on-disk sort order only within a partition key. The default order is ASC.
I would suggest some more reading - When not to use an index and Using a secondary index
Given a distributed system which is persisting records with a primary key being 'url'. Given that multiple servers are collecting data, the 'url' is a handy/convenient and accurate means of guaranteeing uniqueness. Our system queries documents by as frequently as 10,000 times per minute at the moment.
We would like to add another unique key, being a 'uuid' so that we can refer to resources as:
http://example.com/fju98hfhsiu
Rather than, for example:
http://example.com/?u=http%3A%2F%2Fthis.is.a.long.url.com%2Fthis_is%2Fa%2Fpagewitha%2Flong-url.html
It seems that creation of secondary index of UUID's is not ideal in cassandra. Is there any way to avoid creating a secondary index of UUID's in cassandra?
Let's start with the fact, that best practice and the main pattern of Cassandra is to create tables for queries, and not queries for tables, if you need to create index on table, it is "auto" anti pattern. Based on this, the simplest solution is just to use 2 tables with 2 keys.
In your case, the "uuid", is not UUID, it is some concatenation of domain and hash, of the rest of the URL i believe .If your application can generate this key on the time of request, you can just use it as the partition key, and the full URL as clustering key.
Also, if there is no hot domains,(for example http://example.com) you can use the domain as the partition key, and hash and long urls as clustering keys, creating materialized views to support different queries.
In the end, just add secondary index and see performance impact in your specific case. If it works for you, and you don't want do deal with 2 tables, materialized views etc, just use it.
I have a table for caching geocoding results where I had planned to use the search string as the primary key/clustered index, since I wanted this to be both unique and indexed for quick lookup. This would be pretty big, probably nvarchar(300). I've seen that large cluster keys are generally advised against due to size and performance, and I suppose that uniqueness isn't critical since i'm just caching results, so is there a better way to achieve this?
The search string can contain optional bounding coordinates that could be put into a separate column, but they do form part of the uniqueness identity.
Also i'd want to keep this table compatible with Sql Azure, which requires a clustered index somewhere in the table.