Querying a High Cardinality Field

Querying a High Cardinality Field - database

I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.
My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.
What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.
Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.
What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?

I thought why not create another table that stores the mapping.
That's okay. From Cassandra documentation:
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of
records for a small number of results. See Problems using a
high-cardinality column index below.
Problems using a high-cardinality column index
If you create an index on a high-cardinality column, which has many
distinct values, a query between the fields incurs many seeks for very
few results. In the table with a billion songs, looking up songs by
writer (a value that is typically unique for each song) instead of by
their recording artist is likely to be very inefficient..
It would probably be more efficient to manually maintain the table as
a form of an index instead of using the built-in index. For columns
containing unique data, it is sometimes fine performance-wise to use
an index for convenience, as long as the query volume to the table
having an indexed column is moderate and not under constant load.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
It's normal for Cassandra data modelling to have a denormalized data.

Related

Covering index including rowversion? Good or bad

I'm working on synchronizing clients with data for eventual consistency. The server will publish a list of database ids and rowversion/timestamp. Client will then request data with incorrect version number. The primary reason for inconsistent data is networking issues between broker nodes, split brain, etc.
When I read data from my tables, I request data based on a predicate that is not the primary key.
I iterate available regions to read data per region. This is my select:
SELECT DatabaseId, VersionTimestamp, OperationId
FROM TableX
WHERE RegionId = 1
Since this leads to an index scan per query, I'm wondering if a non-clustered index on my RegionId column, and include the selected columns in that index:
CREATE NONCLUSTERED INDEX [ID_TableX_RegionId_Sync]
ON [dbo].[TableX] ([RegionId])
INCLUDE ([DatabaseId],[VersionTimestamp],[OperationId])
VersionTimestamp is rowversion/timestamp column, and will of course change whenever a row is updated, so I'm wondering if it is a poor design choice to include this column in an index since it will need to be updated at every insert/update/delete?
Since this will result in n index scans, rather than n index seeks, it might be better to read all the data once, and then group by regionId and fill in empty lists of rows where a regionId doesn't have any data.
The real life scenario is a bit more complicated, as there are table relationships that will also have to be queried. I haven not yet looked at including one to many relationships in my version queries.
This is primarily about better understanding the impact of covering indexes and figuring out how to better use them. Since I am going to read all the data from the table in any case, it is probably cheaper to load them all at once. However, reading them as from the query above, it makes my code a lot cleaner for this simple no-relationship example alone.
Edit:
Alternative 2
Another option that came to mind, is creating a covering index on RegionId, and include my primary key (DatabaseId).
SELECT DatabaseId
FROM TableX WHERE RegionId=1
And then a new query where I select the needed columns WHERE DatabaseId IN(list, of, databaseId)
For the current scenario, there are only max thousands of rows in the table, and not in the millions. Network traffic for the two (x n) queries might most likely outweigh the benefits of using indexes, and be premature optimization.

Cassandra - search by clustered key

This is my diseases table definition:
id text,
drugid text,
name
PRIMARY KEY (drugid, id)
Now I want to perform search by drugid column only (all values in this column are unique). This primary key was created due to quick drug search.
Now - what will be best solution to filter this table using id? Creating new table? Pass additional value (drugid) to SELECT? Is it option with only id?
Thans for help :)

Looking at your table definition, the partition key is drugid. This means that your queries will have to include the drugid. But since id is also part of the primary key, you could do something like:
select * from diseases where drugid = ? and id = ?
Unfortunately just having the id is not possible, unless you create a secondary index on it. Which wouldn't be very good since you could trigger a full cluster scan.
So, the solutions are:
specify the partition key (if possible), in this case drugid
create a new table that will have the id as partition key; in this case you will need to maintain both tables;
I guess the solution you'll choose depends on your data set. You should test to see how each solution behaves.
Should you use a secondary index?
When specifying the partition key, Cassandra will read the exact data from the partition and from only one node.
When you create a secondary index, Cassandra needs to read the data from partitions spread across the whole cluster. There are performance impact implications when an index is built over a column with lots of distinct values. Here is some more reading on this matter - Cassandra at Scale: The Problem with Secondary Indexes
In the above article, there is an interesting comment by #doanduyhai:
"There is only 1 case where secondary index can perform very well and
NOT suffer from scalability issue: when used in conjunction with
PARTITION KEY. If you ensure that all of your queries using secondary
index will be of the form :
SELECT ... FROM ... WHERE partitionKey=xxx AND my_secondary_index=yyy
then you're safe to go. Better, in this
case you can mix in many secondary indices. Performance-wise, since
all the index reading will be local to a node, it should be fine"
I would stay away from secondary indexes.
From what you described, id will have distinct values, more or less, so you might run into performance issues since "a general rule of thumb is to index a column with low cardinality of few values".
Also, if id is a clustering column, the data will be stored in an ordered manner. The clustering column(s) determine the data’s on-disk sort order only within a partition key. The default order is ASC.
I would suggest some more reading - When not to use an index and Using a secondary index

Clustered index consideration in regards to distinct valus and large result sets and a single vertical table for auditing

I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...

Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.

Is there any benefit to an index when the column is only used for (hash) joining?

Lets say that a column will only be used for joining. (i.e. I won't be ordering on the column, nor will a search for specific values in the column individually) ... the only thing that I will use the column for is joining to another table.
If the database supports Hash Joins (which from my understanding don't benefit from indexes) .. then wouldn't the addition of an index be completely redundant? (and wasteful) ?

In SQL Server it will still prevent a Key Lookup.
If you JOIN on an unindexed field, the server needs to get the values for that field from the clustered index.
If you JOIN on a NC index, the values can be obtained directly without loading all the data pages from the cluster (which really is the whole table).
So essentially you save yourself a lot of IO as the first step filters down based on a very narrow index instead of on the entire table loaded from disk.

Database optimization: Hashing all the values

Typically, the databases are designed as below to allow multiple types for an entity.
Entity Name
Type
Additional info
Entity name can be something like account number and type could be like savings,current etc in a bank database for example.
Mostly, type will be some kind of string. There could be additional information associated with an entity type.
Normally queries will be posed like this.
Find account numbers of this particular type?
Find account numbers of type X, having balance greater than 1 million?
To answer these queries, query analyzer will scan the index if the index is associated with a particular column. Otherwise, it will do a full scan of all the rows.
I am thinking about the below optimization.
Why not we store the hash or integral value of each column data in the actual table such that the ordering property is maintained, so that it will be easy for comparison.
It has below advantages.
1. Table size will be lot less because we will be storing small size values for each column data.
2. We can construct a clustered B+ tree index on the hash values for each column to retrieve the corresponding rows matching or greater or smaller than some value.
3. The corresponding values can be easily retrieved by having B+ tree index in the main memory and retrieving the corresponding values.
4. Infrequent values will never need to retrieved.
I am still having more optimizations in my mind. I will post those based on the feedback to this question.
I am not sure if this is already implemented in database, this is just a thought.
Thank you for reading this.
-- Bala
Update:
I am not trying to emulate what the database does. Normally indexes are created by the database administrator. I am trying to propose a physical schema by having indexes on all the fields in the database, so that database table size is reduced and its easy to answer few queries.
Updates:(Joe's answer)
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
In a typical table, all the physical data will be stored. But now by generating a hash value on each column data, I am only storing the hash value in the actual table. I agree that its not reducing the size of the database, but its reducing the size of the table. It will be useful when you don't need to return all the column values.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
There can be only one clustered index on a table and all other indexes have to unclustered indexes. With my approach I will be having clustered index on all the values of the database. It will improve query performance.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
The basic idea is that instead of creating a separate table for each column for efficient access, we are doing it at the physical level.
Now the table will look like this.
Row1 - OrderedHash(Column1),OrderedHash(Column2),OrderedHash(Column3)

Google for "hash index". For example, in SQL Server such an index is created and queried using the CHECKSUM function.
This is mainly useful when you need to index a column which contains long values, e.g. varchars which are on average more than 100 characters or something like that.

How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?

I don't think your approach is very helpful.
Hash values only help for equality/inequality comparisons, but not less than/greater than comparisons, compared to pretty much every database index.
Even with (in)equality hash functions do not offer 100% guarantee of having given you the right answer, as hash collisions can happen, so you will still have to fetch and compare the original value - boom, you just lost what you wanted to save.
You can have the rows in a table ordered only one way at a time. So if you have an application where you have to order rows differently in different queries (e.g. query A needs a list of customers ordered by their name, query B needs a list of customers ordered by their sales volume), one of those queries will have to access the table out-of-order.
If you don't want the database to have to work around colums you do not use in a query, then use indexes with extra data columns - if your query is ordered according to that index, and your query only uses columns that are in the index (coulmns the index is based on plus columns you have explicitly added into the index), the DBMS will not read the original table.
Etc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight