I am designing a DynamoDB table, I have following attributes:
uniqueID | TimeStamp | Type | Content | flag
I need to get sorted list of all rows based on timestamp having flag set to true.
uniqueID is system generated ID.
TimeStamp is system time while populating table.
Number of distinct Type will be less than 10.
flag: true/false
I can think of following 3 approaches:
To make uniqueID as partition key for the table, And create Global Secondary Index as flag & TimeStamp, Partition and Sort keys respectively. Now I can query Global Secondary index with hash as flag and get sorted items on TimeStamp.
But the problem here is, as value of flag will be true and false only, and no of rows having flag set to false is relatively very less compared to true, there will be only 2 partitions. This loses all scaling characteristics of DynamoDB.
One another alternative is making Type as Partition key and TimeStamp as sort key for Global Secondary Index. This is better. But while querying I can't select all types of Type as DynamoDB requires Hash key in Query parameter. So I need to query this GSI multiple times to get data for all types of Type hash key.
Scan the table (Scan Operation): Scan returns all data with flag set to true without requirement of hash key but It won't give me sorted results on creationTime.
After analyzing use case, I think approach 1 is the best for now.
Could you please suggest any other approach better that this.
Thanks in advance!
Any partition key that is based on flag or TypeOfInfo will be bad as there are only few possible values (2 and 10 respectively) and the way your data goes into partitions will be skewed. You need to use something that provides a good distribution and in your case the base candidate for the partition key of the table is uniqueId.
The problem is that when you want to get the results based on flag, especially when flag is true, you will get a lot of records, possibly big majority. So scaling of DynamoDB won't give you much anyway if you need to get back most records.
You can try to create a GSI with flag as the partition key and timestamp as the range key. This is not an ideal set of keys but covers what you need. Having a good key for the table means that you can later easily switch to another solution (e.g. scanning and not using the GSI). Keep in mind that if you want to avoid querying the table when using the GSI, you will have to project those attributes you want to return into the GSI.
So summing up, I think you can choose between the GSI and scanning:
Scanning can be slower (test it) but won't require additional data storage
GSI can be faster (test it) but will require additional data storage.
Related
I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.
My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.
What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.
Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.
What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?
I thought why not create another table that stores the mapping.
That's okay. From Cassandra documentation:
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of
records for a small number of results. See Problems using a
high-cardinality column index below.
Problems using a high-cardinality column index
If you create an index on a high-cardinality column, which has many
distinct values, a query between the fields incurs many seeks for very
few results. In the table with a billion songs, looking up songs by
writer (a value that is typically unique for each song) instead of by
their recording artist is likely to be very inefficient..
It would probably be more efficient to manually maintain the table as
a form of an index instead of using the built-in index. For columns
containing unique data, it is sometimes fine performance-wise to use
an index for convenience, as long as the query volume to the table
having an indexed column is moderate and not under constant load.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
It's normal for Cassandra data modelling to have a denormalized data.
I have a table in dynamoDB with id as primary key and a global secondary index (GSI). GSI has hash key as p_identifier and range key as financialID. FinancialID is a 6 digit number starting with 100000. I have a requirement to get the maximum of the financialID so that next record to be added can have financialID incremented by 1.
Can anyone help me in making this work? Also is there is any other alternative to do this?
I would go and use a different approach.
From your requests I am assuming financialID should be unique.
The database won't prevent you from duplicating it and you should make sure some other part of your application syncs these numbers. So you need an atomic counter.
If you must use DynamoDB alone, you should have a table set up just for this type of task.
A table where you have a hash primary key called financial_id_counter and you atomically raise it by 1 and use the id retrieved as the next financialID to be used.
This is not ideal, but can work when issuing updates with UpdateItem ADD.
If you need the FinancialID to be in strict-order, approach by #Chen is good.
On the other hand, if you just need a unique id here, you can use a UUID. Here too there is a very small chance of collision. To counter this, you need to use the API with the "Exists" condition - the call will fail, if it exists and then you can retry with another UUID.
Firstly, incrementing will not be a good idea for DynamoDB, but the following can be a workaround:
We have to query based on equal-to operator, so let's say:
p_identifier = 101
and you can use
scanIndexForward-false
(will sort data descending based on your range key) and get that item and increment your key.
If you don't know p_identifier, then you need to scan (which is not recommended) and manually find largest key and increment.
I'm wondering what the best way to setup the keys for a table holding activity stream data. Each activity type will have different attributes (with some common ones). Here is an example of what some items will consist of:
A follow activity:
type
user_id
timestamp
follower_user_id
followee_user_id
A comment activity
type
user_id
timestamp
comment_id
commenter_user_id
commented_user_id
For displaying the stream I will be querying against the user_id and ordering by timestamp. There will also be other types of queries - for example I will occasionally need to query user_id AND type as well as stuff like comment_id, follower_user_id etc.
So my questions are:
Should my primary key be a hash and range key using user_id and timestamp?
Do I need secondary indexed for every other item - e.g. comment_id or will results return quick enough without the index? Secondary indexes are limited to 5 which wouldn't be enough for all the types of queries I will need to perform.
I'd consider whether you could segment the data into two (or more) tables - allowing better use of your queries. Combine the two as (and if) needed, ie - your type becomes your table rather than a discriminator like you would do in SQL
If you don't separate the tables, then my answers would be
Yes - I think that would be the best bet given that it seems like most of the time, that will be the way you are using it.
No. But you do need to consider what the most frequent queries are and the performance considerations around it. Which ones need to be performant - and which ones are "good enough" good enough?
A combination of caching and asynchronous processing can allow a slow performing scan to be good enough - but it doesn't eliminate the requirement to have some local secondary indexes.
I'm looking at Amazon's DynamoDB as it looks like it takes away all of the hassle of maintaining and scaling your database server. I'm currently using MySQL, and maintaining and scaling the database is a complete headache.
I've gone through the documentation and I'm having a hard time trying to wrap my head around how you would structure your data so it could be easily retrieved.
I'm totally new to NoSQL and non-relational databases.
From the Dynamo documentation it sounds like you can only query a table on the primary hash key, and the primary range key with a limited number of comparison operators.
Or you can run a full table scan and apply a filter to it. The catch is that it will only scan 1Mb at a time, so you'd likely have to repeat your scan to find X number of results.
I realize these limitations allow them to provide predictable performance, but it seems like it makes it really difficult to get your data out. And performing full table scans seems like it would be really inefficient, and would only become less efficient over time as your table grows.
For Instance, say I have a Flickr clone. My Images table might look something like:
Image ID (Number, Primary Hash Key)
Date Added (Number, Primary Range Key)
User ID (String)
Tags (String Set)
etc
So using query I would be able to list all images from the last 7 days and limit it to X number of results pretty easily.
But if I wanted to list all images from a particular user I would need to do a full table scan and filter by username. Same would go for tags.
And because you can only scan 1Mb at a time you may need to do multiple scans to find X number of images. I also don't see a way to easily stop at X number of images. If you're trying to grab 30 images, your first scan might find 5, and your second may find 40.
Do I have this right? Is it basically a trade-off? You get really fast predictable database performance that is virtually maintenance free. But the trade-off is that you need to build way more logic to deal with the results?
Or am I totally off base here?
Yes, you are correct about the trade-off between performance and query flexibility.
But there are a few tricks to reduce the pain - secondary indexes/denormalising probably being the most important.
You would have another table keyed on user ID, listing all their images, for example. When you add an image, you update this table as well as adding a row to the table keyed on image ID.
You have to decide what queries you need, then design the data model around them.
I think you need create your own secondary index, using another table.
This table "schema" could be:
User ID (String, Primary Key)
Date Added (Number, Range Key)
Image ID (Number)
--
That way you can query by User ID and filter by Date as well
You can use composite hash-range key as primary index.
From the DynamoDB Page:
A primary key can either be a single-attribute hash key or a composite
hash-range key. A single attribute hash primary key could be, for
example, “UserID”. This would allow you to quickly read and write data
for an item associated with a given user ID.
A composite hash-range key is indexed as a hash key element and a
range key element. This multi-part key maintains a hierarchy between
the first and second element values. For example, a composite
hash-range key could be a combination of “UserID” (hash) and
“Timestamp” (range). Holding the hash key element constant, you can
search across the range key element to retrieve items. This would
allow you to use the Query API to, for example, retrieve all items for
a single UserID across a range of timestamps.
Typically, the databases are designed as below to allow multiple types for an entity.
Entity Name
Type
Additional info
Entity name can be something like account number and type could be like savings,current etc in a bank database for example.
Mostly, type will be some kind of string. There could be additional information associated with an entity type.
Normally queries will be posed like this.
Find account numbers of this particular type?
Find account numbers of type X, having balance greater than 1 million?
To answer these queries, query analyzer will scan the index if the index is associated with a particular column. Otherwise, it will do a full scan of all the rows.
I am thinking about the below optimization.
Why not we store the hash or integral value of each column data in the actual table such that the ordering property is maintained, so that it will be easy for comparison.
It has below advantages.
1. Table size will be lot less because we will be storing small size values for each column data.
2. We can construct a clustered B+ tree index on the hash values for each column to retrieve the corresponding rows matching or greater or smaller than some value.
3. The corresponding values can be easily retrieved by having B+ tree index in the main memory and retrieving the corresponding values.
4. Infrequent values will never need to retrieved.
I am still having more optimizations in my mind. I will post those based on the feedback to this question.
I am not sure if this is already implemented in database, this is just a thought.
Thank you for reading this.
-- Bala
Update:
I am not trying to emulate what the database does. Normally indexes are created by the database administrator. I am trying to propose a physical schema by having indexes on all the fields in the database, so that database table size is reduced and its easy to answer few queries.
Updates:(Joe's answer)
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
In a typical table, all the physical data will be stored. But now by generating a hash value on each column data, I am only storing the hash value in the actual table. I agree that its not reducing the size of the database, but its reducing the size of the table. It will be useful when you don't need to return all the column values.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
There can be only one clustered index on a table and all other indexes have to unclustered indexes. With my approach I will be having clustered index on all the values of the database. It will improve query performance.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
The basic idea is that instead of creating a separate table for each column for efficient access, we are doing it at the physical level.
Now the table will look like this.
Row1 - OrderedHash(Column1),OrderedHash(Column2),OrderedHash(Column3)
Google for "hash index". For example, in SQL Server such an index is created and queried using the CHECKSUM function.
This is mainly useful when you need to index a column which contains long values, e.g. varchars which are on average more than 100 characters or something like that.
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
I don't think your approach is very helpful.
Hash values only help for equality/inequality comparisons, but not less than/greater than comparisons, compared to pretty much every database index.
Even with (in)equality hash functions do not offer 100% guarantee of having given you the right answer, as hash collisions can happen, so you will still have to fetch and compare the original value - boom, you just lost what you wanted to save.
You can have the rows in a table ordered only one way at a time. So if you have an application where you have to order rows differently in different queries (e.g. query A needs a list of customers ordered by their name, query B needs a list of customers ordered by their sales volume), one of those queries will have to access the table out-of-order.
If you don't want the database to have to work around colums you do not use in a query, then use indexes with extra data columns - if your query is ordered according to that index, and your query only uses columns that are in the index (coulmns the index is based on plus columns you have explicitly added into the index), the DBMS will not read the original table.
Etc.