Cassandra - search by clustered key

Cassandra - search by clustered key - database

This is my diseases table definition:
id text,
drugid text,
name
PRIMARY KEY (drugid, id)
Now I want to perform search by drugid column only (all values in this column are unique). This primary key was created due to quick drug search.
Now - what will be best solution to filter this table using id? Creating new table? Pass additional value (drugid) to SELECT? Is it option with only id?
Thans for help :)

Looking at your table definition, the partition key is drugid. This means that your queries will have to include the drugid. But since id is also part of the primary key, you could do something like:
select * from diseases where drugid = ? and id = ?
Unfortunately just having the id is not possible, unless you create a secondary index on it. Which wouldn't be very good since you could trigger a full cluster scan.
So, the solutions are:
specify the partition key (if possible), in this case drugid
create a new table that will have the id as partition key; in this case you will need to maintain both tables;
I guess the solution you'll choose depends on your data set. You should test to see how each solution behaves.
Should you use a secondary index?
When specifying the partition key, Cassandra will read the exact data from the partition and from only one node.
When you create a secondary index, Cassandra needs to read the data from partitions spread across the whole cluster. There are performance impact implications when an index is built over a column with lots of distinct values. Here is some more reading on this matter - Cassandra at Scale: The Problem with Secondary Indexes
In the above article, there is an interesting comment by #doanduyhai:
"There is only 1 case where secondary index can perform very well and
NOT suffer from scalability issue: when used in conjunction with
PARTITION KEY. If you ensure that all of your queries using secondary
index will be of the form :
SELECT ... FROM ... WHERE partitionKey=xxx AND my_secondary_index=yyy
then you're safe to go. Better, in this
case you can mix in many secondary indices. Performance-wise, since
all the index reading will be local to a node, it should be fine"
I would stay away from secondary indexes.
From what you described, id will have distinct values, more or less, so you might run into performance issues since "a general rule of thumb is to index a column with low cardinality of few values".
Also, if id is a clustering column, the data will be stored in an ordered manner. The clustering column(s) determine the data’s on-disk sort order only within a partition key. The default order is ASC.
I would suggest some more reading - When not to use an index and Using a secondary index

Related

Querying a High Cardinality Field

I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.
My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.
What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.
Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.
What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?

I thought why not create another table that stores the mapping.
That's okay. From Cassandra documentation:
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of
records for a small number of results. See Problems using a
high-cardinality column index below.
Problems using a high-cardinality column index
If you create an index on a high-cardinality column, which has many
distinct values, a query between the fields incurs many seeks for very
few results. In the table with a billion songs, looking up songs by
writer (a value that is typically unique for each song) instead of by
their recording artist is likely to be very inefficient..
It would probably be more efficient to manually maintain the table as
a form of an index instead of using the built-in index. For columns
containing unique data, it is sometimes fine performance-wise to use
an index for convenience, as long as the query volume to the table
having an indexed column is moderate and not under constant load.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
It's normal for Cassandra data modelling to have a denormalized data.

Cassandra: Why do I not have to include all partition keys in query?

Currently, I am dealing with Cassandra.
While reading a blog post, it is said:
When issuing a CQL query, you must include all partition key columns,
at a minimum.
(https://shermandigital.com/blog/designing-a-cassandra-data-model/)
However, in my database it seems like it possible without including all partition keys. Here the table:
CREATE TABLE usertable (
personid text,
name text,
"timestamp" timestamp,
active boolean,
PRIMARY KEY ((personid, name), timestamp)
) WITH
CLUSTERING ORDER BY ("timestamp" DESC)
AND comment=''
AND read_repair_chance=0
AND dclocal_read_repair_chance=0.1
AND gc_grace_seconds=864000
AND bloom_filter_fp_chance=0.01
AND compaction={ 'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold':'32',
'min_threshold':'4' }
AND compression={ 'chunk_length_in_kb':'64',
'class':'org.apache.cassandra.io.compress.LZ4Compressor' }
AND caching={ 'keys':'ALL',
'rows_per_partition':'NONE' }
AND default_time_to_live=0
AND id='23ff16b0-c400-11e8-55c7-2b453518a213'
AND min_index_interval=128
AND max_index_interval=2048
AND memtable_flush_period_in_ms=0
AND speculative_retry='99PERCENTILE';
So I can do select * from usertable where personid = 'ABC-02';. However, according to the blog post, I have to include timestamp as well.
Can someone explain this?

In cassandra, partition key spreads data around cluster. It computes the hash of partition key and determine the location of data in the cluster.
One exception is, if you use ALLOW FILTERING or secondary index it does not require you too include all partition keys in where query.
For further information take a look at blog post:
The purpose of a partition key is to split the data into partitions
where an entire partition is stored on a single node in the cluster
(with each node storing many partitions). When data is read or written
from the cluster, a function called Partitioner is used to compute the
hash value of the partition key. This hash value is used to determine
the node/partition which contains that row. The clustering key is used
further to search for a row within a given partition.
Select queries in Apache Cassandra look a lot like select queries from
a relational database. However, they are significantly more
restricted. The attributes allowed in ‘where’ clause of Cassandra
query must include the full partition key and additional clauses may
only reference the clustering key columns or a secondary index of the
table being queried.
Requiring the partition key attributes in the ‘where’ helps Cassandra
to maintain constant result-set retrieval time as the cluster is
scaled-out by allowing Cassandra to determine the partition, and thus
the node (and even data files on disk), that the query must be
directed to.
If a query does not specify the values for all the columns from the
primary key in the ‘where’ clause, Cassandra will not execute it and
give the following warning :
‘InvalidRequest: Error from server: code=2200 [Invalid query]
message=”Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW
FILTERING” ‘
https://www.instaclustr.com/apache-cassandra-scalability-allow-filtering-partition-keys/
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause

According to your schema, your timestamp column is the clustering column, the sorting column, no part of the partition key. That’s why it is not required.
(personid, name) are your partitions columns.

Difference between clustered and nonclustered index [duplicate]

This question already has answers here:
What are the differences between a clustered and a non-clustered index?
(13 answers)
Closed 7 years ago.
I need to add proper index to my tables and need some help.
I'm confused and need to clarify a few points:
Should I use index for non-int columns? Why/why not
I've read a lot about clustered and non-clustered index yet I still can't decide when to use one over the other. A good example would help me and a lot of other developers.
I know that I shouldn't use indexes for columns or tables that are often updated. What else should I be careful about and how can I know that it is all good before going to test phase?

A clustered index alters the way that the rows are stored. When you create a clustered index on a column (or a number of columns), SQL server sorts the table’s rows by that column(s). It is like a dictionary, where all words are sorted in alphabetical order in the entire book.
A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. It creates a completely different object within the table that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data. It is like an index in the last pages of a book, where keywords are sorted and contain the page number to the material of the book for faster reference.

You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way!
One rule of thumb I would apply is this: any "regular" table (one that you use to store data in, that is a lookup table etc.) should have a clustering key. There's really no point not to have a clustering key. Actually, contrary to common believe, having a clustering key actually speeds up all the common operations - even inserts and deletes (since the table organization is different and usually better than with a heap - a table without a clustering key).
Kimberly Tripp, the Queen of Indexing has a great many excellent articles on the topic of why to have a clustering key, and what kind of columns to best use as your clustering key. Since you only get one per table, it's of utmost importance to pick the right clustering key - and not just any clustering key.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Marc

You should be using indexes to help SQL server performance. Usually that implies that columns that are used to find rows in a table are indexed.
Clustered indexes makes SQL server order the rows on disk according to the index order. This implies that if you access data in the order of a clustered index, then the data will be present on disk in the correct order. However if the column(s) that have a clustered index is frequently changed, then the row(s) will move around on disk, causing overhead - which generally is not a good idea.
Having many indexes is not good either. They cost to maintain. So start out with the obvious ones, and then profile to see which ones you miss and would benefit from. You do not need them from start, they can be added later on.
Most column datatypes can be used when indexing, but it is better to have small columns indexed than large. Also it is common to create indexes on groups of columns (e.g. country + city + street).
Also you will not notice performance issues until you have quite a bit of data in your tables. And another thing to think about is that SQL server needs statistics to do its query optimizations the right way, so make sure that you do generate that.

A comparison of a non-clustered index with a clustered index with an example
As an example of a non-clustered index, let’s say that we have a non-clustered index on the EmployeeID column. A non-clustered index will store both the value of the
EmployeeID
AND a pointer to the row in the Employee table where that value is actually stored. But a clustered index, on the other hand, will actually store the row data for a particular EmployeeID – so if you are running a query that looks for an EmployeeID of 15, the data from other columns in the table like
EmployeeName, EmployeeAddress, etc
. will all actually be stored in the leaf node of the clustered index itself.
This means that with a non-clustered index extra work is required to follow that pointer to the row in the table to retrieve any other desired values, as opposed to a clustered index which can just access the row directly since it is being stored in the same order as the clustered index itself. So, reading from a clustered index is generally faster than reading from a non-clustered index.

In general, use an index on a column that's going to be used (a lot) to search the table, such as a primary key (which by default has a clustered index). For example, if you have the query (in pseudocode)
SELECT * FROM FOO WHERE FOO.BAR = 2
You might want to put an index on FOO.BAR. A clustered index should be used on a column that will be used for sorting. A clustered index is used to sort the rows on disk, so you can only have one per table. For example if you have the query
SELECT * FROM FOO ORDER BY FOO.BAR ASCENDING
You might want to consider a clustered index on FOO.BAR.
Probably the most important consideration is how much time your queries are taking. If a query doesn't take much time or isn't used very often, it may not be worth adding indexes. As always, profile first, then optimize. SQL Server Studio can give you suggestions on where to optimize, and MSDN has some information1 that you might find useful

faster to read than non cluster as data is physically storted in index order
we can create only one per table.(cluster index)
quicker for insert and update operation than a cluster index.
we can create n number of non cluster index.

Does selecting only indexed attributes result in faster queries?

When performing a query where the attributes selected make up the components of an index does that result in a faster query? I would imagine that the query planner/optimizer could see that the requested columns could be satisfied completely by the index scan.
Trivial Example
CREATE TABLE "liked" (
"id" BIGINT NOT NULL DEFAULT nextval('liked_id_seq'),
"userid" BIGINT NOT NULL,
"storyid" BIGINT NOT NULL,
"notes" TEXT,
PRIMARY KEY ("id")
);
CREATE INDEX "liked_user" ON "liked" (
"userid",
"storyid"
);
ALTER TABLE "liked" ADD FOREIGN KEY ("userid") REFERENCES "users" ("id") ON DELETE CASCADE;
ALTER TABLE "liked" ADD FOREIGN KEY ("storyid") REFERENCES "story" ("id") ON DELETE CASCADE;
SELECT storyid from liked where userid = 1;
With the query above there isn't any data external to what is already contained in the liked_user index so I would imagine there would be less actions if the query optimizer could infer that the resulting tuples could be satisfied by the index alone.

It's called a "covering index", and it improves the efficiency somewhat, by varying amounts depending on which DBMS you are using (and if you are using MySQL, which flavor of storage).
Try giving an example of a specific situation, if you have one.

In general, yes, but it depends on how you are accessing them. Using LIKE to key off a match in the middle out of a string field isn't going to be any faster with an index.

Not so much, in my experience. You speed up queries by optimizing their conditions, and trying to use the best possible index in those conditions. There are many ways to slow down a query based on what you are selecting, such as using subqueries, perhaps some UDFs--and of course you can slow down queries using some less-than-optimal joins.

It can do, but there are some caveats. These comments are based on Oracle, btw.
For example, SELECT COL1 FROM MY_TABLE might be able to use an index, but if all the columns of the index are nullable then there might be rows not included in a regular btree index, so the index might not be used.
It's also possible that an index might be larger (and therefore more costly to ful scan) than the underlying table (for example where the table only has a single column) because the index has to include a rowid for every entry as well as the column values. In that case, unless the query can leverage the index information in some special way (for example you are including an ORDER BY clause that the index can supply without the need for a sort) then the index might not be used.
You ought to also look into the various index access methods that the RDBMS can use in order to understand their strengths and weaknesses. In Oracle these would generally be INDEX RANGE SCAN, FULL INDEX SCAN, FAST FULL INDEX SCAN, and INDEX SKIP SCAN. This knowledge will help you understand whether an index could be used and in what way.

Should I get rid of clustered indexes on Guid columns

I am working on a database that usually uses GUIDs as primary keys.
By default SQL Server places a clustered index on primary key columns. I understand that this is a silly idea for GUID columns, and that non-clustered indexes are better.
What do you think - should I get rid of all the clustered indexes and replace them with non-clustered indexes?
Why wouldn't SQL's performance tuner offer this as a recommendation?

A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.
Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.
As to why it's not offered as a recommendation, I'd suggest the tuner is aware of this fact.

You almost certainly want to establish a clustered index on every table in your database.
If a table does not have a clustered index it is what is referred to as a "Heap" and performance of most types of common queries is less for a heap than for a clustered index table.
Which fields the clustered index should be established on depend on the table itself, and the expected usage patterns of queries against the table. In almost every case you probably want the clustered index to be on a column or a combination of columns that is unique, i.e., (an alternate key), because if it isn't, SQL will add a unique value to the end of whatever fields you select anyway. If your table has a column or columns in it that will be frequently used by queries to select or filter multiple records, (for example if your table contains sales transactions, and your application will frequently request sales transactions by product Id, or even better, a Invoice details table, where in almost every case you will be retrieving all the detail records for a specific invoice, or an invoice table where you often retrieve all the invoices for a particular customer... This is true whether you will be selected large numbers of records by a single value, or by a range of values)
These columns are candidates for the clustered index. The order of the columns in the clustered index is critical.. The first column defined in the index should be the column that will be selected or filtered on first in expected queries.
The reason for all this is based on understanding the internal structure of a database index. These indices are called balanced-tree (B-Tree) indices. they are kinda like a binary tree, except that each node in the tree can have an arbitrary number of entries, (and child nodes), instead of just two. What makes a clustered index different is that the leaf nodes in a clustered index are the actual physical disk data pages of the table itself. whereas the leaf nodes of the non-clustered index just "point" to the tables' data pages.
When a table has a clustered index, therefore, the tables data pages are the leaf level of that index, and each one has a pointer to the previous page and the next page in the index order (they form a doubly-linked-list).
So if your query requests a range of rows that is in the same order as the clustered index... the processor only has to traverse the index once (or maybe twice), to find the start page of the data, and then follow the linked list pointers to get to the next page and the next page, until it has read all the data pages it needs.
For a non-clustered index, it has to traverse the index once for every row it retrieves...
NOTE: EDIT
To address the sequential issue for Guid Key columns, be aware that SQL2k5 has NEWSEQUENTIALID() that does in fact generate Guids the "old" sequential way.
or you can investigate Jimmy Nielsens COMB guid algotithm that is implemented in client side code:
COMB Guids

The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
However, with integer-based clustered indexes, the integers are normally sequential (like with an IDENTITY spec), so they just get added to the end an no data needs to be moved around.
On the other hand, clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. If you need to be able to SELECT records quickly, then use a clustered index... the INSERT speed will suffer, but the SELECT speed will be improved.

While clustering on a GUID is normally a bad idea, be aware that GUIDs can under some circumstances cause fragmentation even in non-clustered indexes.
Note that if you're using SQL Server 2005, the newsequentialid() function produces sequential GUIDs. This helps to prevent the fragmentation problem.
I suggest using a SQL query like the following to measure fragmentation before making any decisions (excuse the non-ANSI syntax):
SELECT OBJECT_NAME (ips.[object_id]) AS 'Object Name',
si.name AS 'Index Name',
ROUND (ips.avg_fragmentation_in_percent, 2) AS 'Fragmentation',
ips.page_count AS 'Pages',
ROUND (ips.avg_page_space_used_in_percent, 2) AS 'Page Density'
FROM sys.dm_db_index_physical_stats
(DB_ID ('MyDatabase'), NULL, NULL, NULL, 'DETAILED') ips
CROSS APPLY sys.indexes si
WHERE si.object_id = ips.object_id
AND si.index_id = ips.index_id
AND ips.index_level = 0;

If you are using NewId(), you could switch to NewSequentialId(). That should help the insert perf.

Yes, there's no point in having a clustered index on a random value.
You probably do want clustered indexes SOMEWHERE in your database. For example, if you have a "Author" table and a "Book" table with a foreign key to "Author", and if you have a query in your application that says, "select ... from Book where AuthorId = ..", then you would be reading a set of books. It will be faster if those book are physically next to each other on the disk, so that the disk head doesn't have to bounce around from sector to sector gathering all the books of that author.
So, you need to think about your application, the ways in which it queries the database.
Make the changes.
And then test, because you never know...

As most have mentioned, avoid using a random identifier in a clustered index-you will not gain the benefits of clustering. Actually, you will experience an increased delay. Getting rid of all of them is solid advice. Also keep in mind newsequentialid() can be extremely problematic in a multi-master replication scenario. If database A and B both invoke newsequentialid() prior to replication, you will have a conflict.

Yes you should remove the clustered index on GUID primary keys for the reasons Galwegian states above. We have done this on our applications.

It depends if you're doing a lot of inserts, or if you need very quick lookup by PK.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight