I'm learning more about DynamoDB and I'm building a full serverless project using it as my main database.
The project is a simple Poll app, where a user can create and vote in polls.
The data model for a SQL database is quite simple in fact this is all that it looks like
But with DynamoDB I need to think about access patterns and this is where it becomes a little bit confusing for me.
After watching a few courses on DynamoDB not specifically related to data modeling I was able to come up with this aggregate view, following the Single Table Design.
This was done in the NoSQL Workbench app
There are the same 3 entities within the same table but I have a question
These are the access patterns I need for my app, at least the ones I can think for now
[Put, Delete, Get] User by userId
[Get] All Users by pollId (who voted in what)
[Put, Delete, Get] Poll by pollId
[Put, Delete, Get] All Polls by userId
[Put, Delete, Get] All Polls by visibility
[Create, Get] Poll Votes by pollId
Are Global Secondary Indexes all I need to satisfy these access patterns?
Are there any pitfalls in the way that I designed the data?
Are there any other tips you might recommend?
You are correct that using a GSI will be helpful to design a good access pattern here, but in your NoSQL Workbench screenshot you are not showing GSIs. Instead you are showing a batch write (or transact write) operation for each vote operation. 1. pk: user Id, sk: poll id, 2. pk: poll id, sk: user Id.
This is less efficient if you want to enforce a rule that no user can vote in the same poll twice. To enforce this with the current key structure you will need to use the transact write API call, which is more expensive.
Instead you can try to use a single write for a vote, keep attributes needed for condition checks in the pk or sk, and use a GSI to offer alternative access patterns.
Alternative Pattern:
Model: Vote
pk: pollId,
sk: userId
gsi1pk: userId,
gsi1sk: pollId
Model: User
pk: userId,
sk: userId
Model: Poll
pk: pollId
sk: pollId
This has a tradeoff in that either 'Get all Poles (votes) by UserId' or 'Get all Users by PollId' will require 2 read operations. Choose the Vote pk and sk to favour the read operation which will see more traffic.
This design also suggests you may want to denormalise some data from the Poll model to the Vote model.
If you application is massively read heavy, your original model could be the correct choice. But consider that a query can read an extra data item for 0.5 RRU, and 1 on-demand RRU costs $0.25/million, whereas a transact write operation costs 2 WRU per item, and 1 on-demand WRU costs $1.25/million. This implies that unless your query 'Get all polls for a user' gets at least 20x the traffic of 'create vote', it is better to use a GSI (GSI costs an extra 1 WRU, but 3 transact write items costs 6 WRU).
Even using a read then batch-write to enforce uniqueness still causes an extra 1 WRU + 1 RRU in consumption for each vote..
Related
We have 3 models: Users which can have Orders which can have a Transaction (like a credit card transaction).
Does it make sense to store the user_id on transactions or is that considered bad practice because of duplication of data? Regardless of whether transactions are always read in the context of an order (as in, in our fictional app we can never just see all the previous transactions, just all previous orders with their transaction). Are there any issues that might arise in the future? I'm not worried about database size (especially if this is just 1 integer column) but am thinking about data and code complexity.
We could have
Model 1: Users
- has many Orders
Model 2: Orders
- belongs to Users
- has many Transactions
Model 3: Transactions
- belongs to Orders
or we could have
Model 1: Users
- has many Orders
- has many transactions
Model 2: Orders
- belongs to Users
- has many Transactions
Model 3: Transactions
- belongs to Orders
- belongs to Users
Your first example is the standard way of doing it. Use that unless there is a good reason not to.
The problem with the second one is that it adds complexity to your code, and there is a chance that data may become inconsistent.
The benefit of the second way is that it is faster to get all transactions belonging to a user (assuming you have an index)
Adding the index will, however, make inserts slower.
Introduction
Hello, I'm moving to AWS because of stability, performance, etc. I will be using DynamoDB because of the always free tier that allows me to reduce my bills a lot. I was using MySQL until now. I will make the attributes simple for this example (to show the actual places where I need help and make the question shorter).
My actual DB has less than 5k rows and I expect it to grow to 20-30k in 2 years. Each user (without any group/order data) is around 600B. I don't know how this will translate to a NoSQL DB but I expect it to be less than 10MB.
What data will I have?
User:
username
password
is_group_member
Group:
capacity
access_level
Order:
oid
status
prod_id
Relationships:
User has many orders.
Group has many users.
How will I access the data and what will I get?
I will access the user by username (I won't know the group he is in). I will need to get the user's data, the group he belongs to and its data.
I will access the users that belong to a certain group. I will need to get the users' data and the group data.
I will access an order by its oid. I will need to get the user it belongs to and its data.
What I tried
I watched a series of videos by Gary Jennings, read answers on SO and also read alexdebrie's article about one-to-many relationships. My problem is that I can't seem to find an alternative that suits all the ways I will access the data.
For example:
Denormalization: it will leave me with a lot of duplicated data thus increasing the cost.
Composite primary key: I will be able to access the users by its group but how will I access the user and the group's data without knowing the group beforehand. I would need to use 2 requests making it inefficient and increasing the costs.
Secondary index + the Query API action: Again I would need to use 2 requests making it inefficient and increasing the costs.
Final questions
Did I misunderstood the alternatives? I started this question because my knowledge is not "big enough" to actually know if there is a better alternative that I can't think of so maybe I got the explanations wrong.
Is there a better alternative for this case?
If there wasn't a better alternative, what would you do in my case? Would you duplicate the group's data (thus increasing the used space and making it need only 1 request)? or would you use one of the other 2 alternatives and use 2 requests?
You're off to a great start by articulating your access patterns ahead of time.
Let's start by addressing some of your comments about data modeling in DynamoDB:
Denormalization: it will leave me with a lot of duplicated data thus increasing the cost.
When first learning DynamoDB data modeling, prior SQL Database knowledge can really get in the way. Normalizing your data is a common practice when working with SQL databases. However, denormalizing your data is a key data modeling strategy in DynamoDB.
One BIG reason you want to denormalize your data: DynamoDB doesn't have joins. Because DDB doesn't have joins, you'll be well served to pre-join your data so it can be fetched in a single query.
This blog post does a good job of explaining why denormalization is important in DDB.
Keep in mind, storage is cheap. Denomralizing your data makes for faster data access at a relatively low cost. With the size of your database, you will likely be well under the free tier threshold. Don't stress about the duplicate data!
Composite primary key: I will be able to access the users by its group but how will I access the user and the group's data without
knowing the group beforehand. I would need to use 2 requests making it inefficient and increasing the costs.
Denormalizing your data will help solve this problem (e.g. store the group info with the user). I'll give you an example of this below.
Secondary index + the Query API action: Again I would need to use 2 requests making it inefficient and increasing the costs.
You didn't share your primary key structure, so I'm not sure what scenario will require two requests. However, I will say that there may be certain situations where making two requests to DDB is a reasonable approach. Making two efficient query operations is not the end of the world.
OK, on to an example of modeling your relationships! Keep in mind that there are many ways to model data in DynamoDB. This example is not THE way. Rather, it's an example meant to demonstrate a few strategies that might help.
Here's one take of your data model:
With this arrangement, you can support the following access patterns:
Fetch user information - PK = USER#[username] SK = USER#[username]
Fetch user group - PK = USER#[username] SK begins_with GROUP#. Notice I denormalized user data in the group item. The reason for this will be apparent shortly :)
Fetch user orders - PK = USER#[username] SK begins_with ORDER#
Fetch all user data - PK = USER#[username]
To support your remaining access patterns, I created a secondary index. The primary key and sort key of the secondary index is swapped with the primary key/sort key of the base table. This pattern is called an inverted index. The secondary index looks like this:
This secondary index supports the following access patterns:
Fetch Group users - PK = GROUP#[grouped]
Fetch Order by oid - PK = ORDER#[oid]
You can see that I denormalized the User and Group relationship by repeating user data in the item representing the Group. This helps me with the "fetch group users" access pattern.
Again, this is just one way you can achieve the access patterns you described. There are many strategies, but many will require that you abandon some of the best practices you learned working with SQL databases!
For my graph database, I have users and transactions. Each user has an id. Each transaction has an id, date and sender and receiver. The sender and receiver of a transaction have the type User.id.
There is a sender/receiver type edge that connects users and transactions together.
I would like to query the 10 most recent transactions for a particular user, user_id, before they sent an arbitrary transaction with the id of txn_id.
How can I optimize the performance of this pagination query? I was thinking of creating a single index for both User.id to find user_id fast. If I index Transaction.date and Transaction.id, would it make it performant to search for a transaction that is older txn_id for an individual user?
If you're talking about a composite index (create index on :Transaction(date, id)) then no, as you would need to have the exact values for lookup for the indexed properties. Any other means of lookup (including range scans) won't use the composite index.
An index on just :Transaction(date) won't help either, as this wouldn't be used once you found your user in question and expanded out the transactions. Starting from transactions by date wouldn't be wise, since you would receive many that weren't associated with the user in question.
You'll need an index on :User(id) for that quick lookup of the user, then you'll just need to use ordering with a limit to find the 10 most recent for the user.
If there can be a great many transactions per user, you might consider creating a linked list between transactions in date order so they can be iterated through much faster. Keep in mind that this would require keeping order as transactions are added, and you'd need to use proper locking techniques to avoid race conditions as you add on to the transactions list.
This question applies to any database table design, where you would have system default items and custom user defaults of the same type (ie user can add his own custom items/settings).
Here is an example of invoicing and paymenttypes, By default an invoice can have payment terms of DueOnReceipt, NET10, NET15, NET30 (this is the default for all users!) therefore you would have two tables "INVOICE" and "PAYMENT_TERM"
INVOICE
Id
...
PaymentTermId
PAYMENT_TERM (System default)
Id
Name
Now what is the best way to allow a user to store their own custom "PaymentTerms" and why? (ie user can use system default payment terms OR user's own custom payment terms that he created/added)
Option 1) Add UserId to PaymentTerm, set userid for the user that has added the custom item and system default userid set to null.
INVOICE
Id
...
PaymentTermId
PaymentTerm
Id
Name
UserId (System Default, UserId=null)
Option 2) Add a flag to Invoice "IsPaymentTermCustom" and Create a custom table "PAYMENT_TERM_CUSTOM"
INVOICE
Id
...
PaymentTermId
PaymentTermCustomId
IsPaymentTermCustom (True for custom, otherwise false for system default)
PaymentTerm
Id
Name
PAYMENT_TERM_CUSTOM
Id
Name
UserId
Now check via SQL query if the user is using a custom payment term or not, if IsPaymentTermCustom=True, it means the user is using custom payment term otherwise its false.
Option 3) ????
...
As a general rule:
Prefer adding columns to adding tables
Prefer adding rows to adding columns
Generally speaking, the considerations are:
Effects of adding a table
Requires the most changes to the app: You're supporting a new kind of "thing"
Requires more complicated SQL: You'll have to join to it somehow
May require changes to other tables to add a foreign key column referencing the new table
Impacts performance because more I/O is needed to join to and read from the new table
Note that I am not saying "never add tables". Just know the costs.
Effects of adding a column
Can be expensive to add a column if the table is large (can take hours for the ALTER TABLE ADD COLUMN to complete and during this time the table wil be locked, effectively bringing your site "down"), but this is a one-time thing
The cost to the project is low: Easy to code/maintain
Usually requires minimal changes to the app - it's a new aspect of a thing, rather than a new thing
Will perform with negligible performance difference. Will not be measurably worse, but may be a lot faster depending on the situation (if having the new column avoids joining or expensive calculations).
Effects of adding rows
Zero: If your data model can handle your new business idea by just adding more rows, that's the best option
(Pedants kindly refrain from making comments such as "there is no such thing as 'zero' impact", or "but there will still be more disk used for more rows" etc - I'm talking about material impact to the DB/project/code)
To answer the question: Option 1 is best (i.e. add a column to the payment option table).
The reasoning is based on the guidelines above and this situation is a good fit for those guidelines.
Further,
I would also store "standard" payment options in the same table, but with a NULL userid; that way you only have to add new payment options when you really have one, rather than for every customer even if they use a standard one.
It also means your invoice table does not need changing, which is a good thing - it means minimal impact to that part of your app.
It seems to me that there are merely "Payment Terms" and "Users". The decision of what are the "Default" payment terms is a business rule, and therefore would be best represented in the business layer of your application.
Assuming that you would like to have a set of pre-defined "default" payment terms present in your application from the start, these would already be present in the payment terms table. However, I would put a reference table in between USERS and PAYMENT TERMS:
USERS:
user-id
user_namde
USER_PAYMENT_TERMS:
userID
payment_term_id
PAYMENT_TERMS:
payment_term_id
payment_term
Your business layer should offer up to the user (or more likely, the administrator) through a GUI the ability to:
Assign 0 to many payment term options to a particular user (some
users may not want one of the defaults to even be available, for
example.
Add custom payment terms, which then become available for assignment to one or more users (but which avoids the creation of duplicate payment terms by different users)
Allows the definition of a custom payment term to be assigned to more than one user (say the user's company a unique payment process which requires all of their users to utilize a payment term other than one of the defaults? Create the custom term once, and assign to all users.
Your application business layer would establish rules governing access to payment terms, which could then be accessed by your user interface.
Your UI would then (again, likely through an administrator function) allow the set up of one or more payment terms in addition to the standards you describe, and then make them available to one or more users through something like a checked list box (for example).
Option 1 is definately better for the following reasons:-
Correctness
You can implement a database constraint for uniqueness of the payment term name
You can implement a foreign key constraint from Invoice to PaymentTerm
Ease of Use
Conducting queries will be much simplier because you will always join from Invoice to PaymentTerm rather than requiring a more complex join. Most of the time when you select you will not care if it is an inbuilt or custom payment term. The optimizer will have an easier time with a normal join instead of one that depends on another column to decide which table to join.
Easier to display a list of PaymentTerms coming from one table
We use Option 1 in our data-model quite alot.
Part of the problem, as I see it, is that different payment terms lead to different calculations, too. If I were still in the welding supply business, I'd want to add "2% 10 NET 30", which would mean 2% discount if the payment is made in full within 10 days, otherwise, net 30."
Setting that issue aside, I think ownership of the payment terms makes sense. Assume that the table of users (not shown) includes the user "system" as, say, user_id 0.
create table payment_terms (
payment_term_id integer primary key,
payment_term_owner_id integer not null references users (user_id),
payment_term_desc varchar(30) not null unique,
);
insert into payment_terms values (1, 0, 'Net 10');
insert into payment_terms values (2, 0, 'Net 15');
...
insert into payment_terms values (5, 1, '2% 10, Net 30');
This keeps foreign keys simple, and it makes it easy to select payment terms at run time for presentation in the user interface.
Be very careful here. You probably want to store the description, not the ID number, with your invoices. (It's unique; you can set a foreign key reference to it.) If you store only the ID number, updating a user's custom description might subtly corrupt all the data that references it.
For example, let's say that the user created a custom payment term number 5, '2% 10, Net 30'. You store the ID number 5 in your table of invoices. Then the user decides that things will be different starting today, and updates that description to '2% 10, Net 20'. Now on all your past invoices, the arithmetic no longer matches the payment terms.
Your auditor will kill you. Twice.
You'll want to prevent ordinary users from deleting rows owned by the system user. There are several ways to do that.
Use a BEFORE DELETE trigger.
Add another table with foreign key references to the rows owned by the system user.
Restrict all access through stored procedures that prevent deleting system rows.
(And flags are almost never the best idea.)
Applying general rules of database design to the problem at hand:
one table for system payment terms
one table for user payment terms
a view of join of the two above
Now you can join invoice on the view of payment terms.
Benefits:
No flag columns
No nulls
You separate system defaults from user data
Things become straight forward for the db
What's the standard relational database idiom for setting permissions for items?
Answers should be general; however, they should be able to be applied to example below. Anything flies: adding columns, adding another table—whatever as long as it works well.
Application / Example
Assume the Twitter database is extremely simple: we have one User table, which contains a login and user id; we have a Tweet table, which contains a tweet id, tweet text, and creator id; and we have a Follower table, which contains the id of the person being followed and the follower.
Now, assume Twitter wants to enable advanced privacy settings (viewing permissions), so that users can pick exactly which followers can view tweets. The settings can be:
Everyone on Twitter
Only current followers (which would of course have to be approved by the user, this doesn't really matter though) EDIT: Current as in, I get a new follower, he sees it; I remove a follower, he stops seeing it.
Specific followers (e.g., user id 5, 10, 234, and 1)
Only the owner
Under these circumstances, what's the best way to represent viewing permissions? The priorities, in order, are speed of lookup (you want to be able to figure out what tweets to display to a user quickly), speed of creation (you don't want to take forever to post a tweet), and efficient use of space (every time I post a tweet to everyone on my followers' list, I shouldn't have to add a row for each and every follower I have to some table.)
Looks like a typical many-to-many relationship -- I don't see any restrictions on what you desire that would allow space savings wrt the typical relational DB idiom for those, i.e., a table with two columns (both foreign keys, one into users and one into tweets)... since the current followers can and do change all the time, posting a tweet to all the followers that are current at the instant of posting (I assume that's what you mean?) does mean adding that many (extremely short) rows to that relationship table (the alternative of keeping a timestamped history of follower sets so you can reconstruct who was a follower at any given tweet-posting time appears definitely worse in time and not substantially better in space).
If, on the other hand, you want to check followers at the time of viewing (rather than at the time of posting), then you could make a special userid artificially meaning "all followers of the current user" (just like you'll have one meaning "all users on Twitter"); the needed SQL to make the lookup fast, in that case, looks hairy but feasible (a UNION or OR with "all tweets for which I'm a follower of the author and the tweet is readable by [the artificial userid representing] all followers"). I'm not getting deep into that maze of SQL until and unless you confirm that it is this peculiar meaning that you have in mind (rather than the simple one which seems more natural to me but doesn't allow any space savings on the relationship table for the action of "post tweet to all followers").
Edit: the OP has clarified they mean the approach I mention in the second paragraph.
Then, assume userid is the primary key of the Users table, the Tweets table has a primary key tweetid and a foreign key author for the userid of each tweet's author, the Followers table is a typical many-to-many relationship table with the two columns (both foreign keys into Users) follower and followee, and the Canread table a not-so-typical many-to-many relationship table, still with two column -- foreign key into Users is column reader, foreign key into Tweets is column tweet (phew;-). Two special users #everybody and #allfollowers are defined with the above meanings (so that posting to everybody, all followers, or "just myself", all add only one row to Canread -- only selective posting to a specific list of N people adds N rows).
So the SQL for the set of tweet IDs a user #me can read is, I think, something like:
SELECT Tweets.tweetid
FROM Tweets
JOIN Canread ON(Tweets.tweetid=Canread.tweet)
WHERE Canread.reader IN (#me, #everybody)
UNION
SELECT Tweets.tweetid
FROM Tweets
JOIN Canread ON(Tweets.tweetid=Canread.tweet)
JOIN Followers ON(Tweets.author=Followers.followee)
WHERE Canread.reader=#allfollowers
AND Followers.follower=#me