Use case
Primary index - random string (unique entry)
Secondary index - random string (there can be hundreds of thousands of rows with same value)
I want to update using primary index but query using secondary index.
Sample
Item and cost
Primary index is item and secondary index is cost, millions of items will have same cost and I need to figure out what items have cost X.
We selected mongo db after running extensive proof of concept on various databases.
AWS document db is in parity with mongodb and is available out of the box for use (AWS managed solution).
Related
(Submitting on behalf of a Snowflake User)
QUESTION:
Why would the filter or the search key(key used in where clause) would be a better choice for cluster key than an order by or group by key.
One resource recommends reading: https://support.snowflake.net/s/article/case-study-how-clustering-can-improve-your-query-performance
Another resource mentions:
The performance of query filter will be better because the data is sorted it would skip all the rows which are not required.
For the scenario which has query filter on columns which are not part of sort order but the columns in group by and order by are part of data sort order (clustered keys), it may take time to select those data but the sorting would be easy since the data is already in an order.
A 3rd resource states:
The clustering key is important for the WHERE clause when you only select a small portion of the overall data that you have in your tables, because it can reduce the amount of data that has to be read from the Storage into the Compute when the Optimizer can use the clustering key for Query Pruning.
You can alternatively use the clustering key to optimize table inserts and possibly also query output (eg sort order).
Your choice should depend on your priorities, there is no cure all unless a single key covers all above.
To which the User responds with the following questions:
If I always insert the rows in the order in which they will be retrieved, do I still need to create a cluster key? For example if a table is always queried using a date_timestamp and if I ensure that I am inserting in the table order by date_timestamp, do I still need to create a cluster key on date_timestamp?
Any thoughts, recommendations, etc.? Thanks!
For choosing a cluster key based on FILTER/GROUP/SORT. The first "resource" is right.
If the filter will result in pruning, then it is probably best (so that data can be skipped.) If all/most of the data must be read, then clustering on a GROUP/SORT key is probably fast (so less time is spent re-sorting) These docs state:
Typically, queries benefit from clustering when the queries filter or
sort on the clustering key for the table. Sorting is commonly done for
ORDER BY operations, for GROUP BY operations, and for some joins.
For the second question on natural clustering, there would be little to no performance benefit for defining a cluster key in that case.
I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.
My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.
What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.
Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.
What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?
I thought why not create another table that stores the mapping.
That's okay. From Cassandra documentation:
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of
records for a small number of results. See Problems using a
high-cardinality column index below.
Problems using a high-cardinality column index
If you create an index on a high-cardinality column, which has many
distinct values, a query between the fields incurs many seeks for very
few results. In the table with a billion songs, looking up songs by
writer (a value that is typically unique for each song) instead of by
their recording artist is likely to be very inefficient..
It would probably be more efficient to manually maintain the table as
a form of an index instead of using the built-in index. For columns
containing unique data, it is sometimes fine performance-wise to use
an index for convenience, as long as the query volume to the table
having an indexed column is moderate and not under constant load.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
It's normal for Cassandra data modelling to have a denormalized data.
This is my diseases table definition:
id text,
drugid text,
name
PRIMARY KEY (drugid, id)
Now I want to perform search by drugid column only (all values in this column are unique). This primary key was created due to quick drug search.
Now - what will be best solution to filter this table using id? Creating new table? Pass additional value (drugid) to SELECT? Is it option with only id?
Thans for help :)
Looking at your table definition, the partition key is drugid. This means that your queries will have to include the drugid. But since id is also part of the primary key, you could do something like:
select * from diseases where drugid = ? and id = ?
Unfortunately just having the id is not possible, unless you create a secondary index on it. Which wouldn't be very good since you could trigger a full cluster scan.
So, the solutions are:
specify the partition key (if possible), in this case drugid
create a new table that will have the id as partition key; in this case you will need to maintain both tables;
I guess the solution you'll choose depends on your data set. You should test to see how each solution behaves.
Should you use a secondary index?
When specifying the partition key, Cassandra will read the exact data from the partition and from only one node.
When you create a secondary index, Cassandra needs to read the data from partitions spread across the whole cluster. There are performance impact implications when an index is built over a column with lots of distinct values. Here is some more reading on this matter - Cassandra at Scale: The Problem with Secondary Indexes
In the above article, there is an interesting comment by #doanduyhai:
"There is only 1 case where secondary index can perform very well and
NOT suffer from scalability issue: when used in conjunction with
PARTITION KEY. If you ensure that all of your queries using secondary
index will be of the form :
SELECT ... FROM ... WHERE partitionKey=xxx AND my_secondary_index=yyy
then you're safe to go. Better, in this
case you can mix in many secondary indices. Performance-wise, since
all the index reading will be local to a node, it should be fine"
I would stay away from secondary indexes.
From what you described, id will have distinct values, more or less, so you might run into performance issues since "a general rule of thumb is to index a column with low cardinality of few values".
Also, if id is a clustering column, the data will be stored in an ordered manner. The clustering column(s) determine the data’s on-disk sort order only within a partition key. The default order is ASC.
I would suggest some more reading - When not to use an index and Using a secondary index
I am new to this full text search thing and cant find an answer to this question, I would like to know your thoughts on the matter.
Taken from:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/b57a0538-033a-41f1-bdfd-8084680043f2/full-text-catalog-best-practices?forum=sqldatabaseengine
Recommendations from SQL Server 2008 Books Online:
We recommend associating tables with the same update characteristics (such as small number of changes versus large number of changes, or tables that change frequently during a particular time of day) together under the same full-text catalog.
By setting up full-text catalog population schedules, full-text indexes stay synchronous with the tables without adversely affecting the resource usage of the database server during periods of high database activity.
When you assign a table to a full-text catalog, consider the following guidelines:
Always select the smallest unique index available for your full-text unique key. (A 4-byte, integer-based index is optimal.) This reduces the resources required by Microsoft Search service in the file system significantly. If the primary key is large (over 100 bytes), consider choosing another unique index in the table (or creating another unique index) as the full-text unique key. Otherwise, if the full-text unique key size exceeds the maximum size allowed (900 bytes), full-text population will not be able to proceed.
If you are indexing a table that has millions of rows, assign the table to its own full-text catalog.
Consider the amount of changes occurring in the tables being full-text indexed, as well as the total number of rows. If the total number of rows being changed, together with the numbers of rows in the table present during the last full-text population, represents millions of rows, assign the table to its own full-text catalog.
I'm just starting to build a Social Site into DynamoDB.
I will have a fair amount of data that relates to a user and I'm planning on putting this all into one table - eg:
userid
date of birth
hair
photos urls
specifics
etc - there could potentially be a few hundred attributes.
Question:
is there anything wrong with putting this amount of data into one table?
how can I query that data (could I do a query like this "All members between this age, this color hair, this location, and logged on this time) - assuming all this data is contained in the table?
if the contents of a table are long and I'm running queries on that table like above would the read IO's cost be high - might be a lot of entries in the table in the long run...
Thanks
No. You can't query DynamoDB this way. You can only query the primary key (and a single range optionally). Scanning the tables in DynamoDB is slow and costly and will cause your other queries to hung.
If you have a small number of attributes, you can easily create index tables for these attributes. But if you have more than a few, it becomes too complex.
Main Table:
Primary Key (Type: Hash) - userid
Attributes - the rest of the attributes
Index Table for "hair":
Primary Key (Type: Hash and Range) - hair and userid
You can check out Amazon SimpleDB that is adding an index for the other attributes as well, therefore allowing such queries as you wanted. But it is limited in its scale and ability to support low latency.
You might also consider a combination of several data stores and tables as your requirements are different between your real time and reporting:
DynamoDB for the quick real time user lookup
SimpleDB/RDBMS (as MySQL or Amazon RDS) for additional attributes filters and queries
In Memory DB (as Redis, Casandra) for counters and tables as leader boards or cohort
Activity logs that you can analyze to discover patterns and trends