Design scenario of a DynamoDB table - database

I am new to DynamoDB and after reading several docs, there is a scenario in which I am not sure which would be the best approach for designing a table.
Consider that we have some JobOffers and we should support the following data access:
get a specific JobOffer
get all JobOffers from a specific Company sorted by different criteria (newest/oldest/wage)
get JobOffers from a specific Company filtered by a specific city sorted by different criteria (newest/oldest/wage)
get all JobOffers (regardless of any Company !!!) sorted by different criteria (newest/oldest/wage)
get JobOffers (regardless of any Company !!!) filtered by a specific city sorted by different criteria (newest/oldest/wage)
Since we need to support sorting, my understanding is that we should use Query instead of Scan. In order to use Query, we need to use a primary key. Because we need to support a search like "get all JobOffers without any filters sorted somehow", which would be a good candidate for partition key?
As a workaround, I was thinking to use a new attribute "country" which can be used as the partition key, but since all JobOffers are specified in one country, all these items fall under the same partition, so it might be a bit redundant until we will have support for JobOffers from different countries.
Any suggestion on how to design JobOffer table (with PK and GSI/LSI) for such a scenario?

Design of a Dynamodb table is best done with an Access approach - that is - how are you going to be accessing the data in here. You have information X, you need Y.
Also remember that a dynamo is NOT an sql, and it is not restricted that every item has to be the same - consider each entry a document, with its PK/SK as directory/item structure in a file system almost.
So for example:
You have user data. You have things like : Avatar data (image name, image size, image type) Login data (salt/pepper hashes, email address, username), Post history (post title, identifier, content, replies). Each user will only have 1 Avatar item and 1 Login item, but have many Post items
You know that from the user page you are always going to have the user ID. 100% of the time. This should then be your PK - your Hash Key, PartitionKey. Then you have the rest of the things you need inform your sort key/range key.
PK
USER#123456
SK:
AVATAR - Attributes: (image name, image size, image type)
PK
USER#123456
SK:
LOGIN - Attributes: (salt/pepper hashes, email address, username)
PK
USER#123456
SK:
POST#123 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#125 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#193 - Attributes: (post title, identifier, content, replies)
This way you can do a query with the User ID and get ALL the user data back. Or if you are on a page that just displays their post history, you can do a query against User ID # SK Begins With POST and get back all t heir posts.
You can put in an inverted index (SK -> PK and vice versa) and then do a query on POST#193 and get back the user ID. Or if you have other PK types with POST#193 as the SK, you get more information there (like a REPLIES#193 PK or something)
The idea here is that you have X bits of information, and you need to craft your dynamo to be able to retrieve as much as possible with just that information, and using prefix's on your SKs you can then narrow the fields a little.
Note!
Sometimes this means a duplication of information! That you may have the same information under two sets of keys. This is ok and kind of expected when you start getting into really complex relationships. You can mitigate it somewhat with index's, but you should aim to avoid them where possible as they do introduce a bit of lag in terms of data propagation (its tiny, but it can add up)
So you have your list of things you want to get for your dynamo. What will you always have to tie them together? What piece of data do you have that will work?
You can do the first 3 with a company PK identifier and a reverse index. That will let you look up and get all a companies jobs, or using the reverse index all a specific job. Or if you can always know the company when looking up a specific job, then it uses the general first index.
Company# - Job# - data data data
You then do the sorting on your own, OR you add some sort of sort valuye to the Job# key - Sort Keys are inherently sorted after all. Company# - Job#1234#UNITED_STATES
of course this will only work for one sort at a time. You can make more than one index, but again - data sync lag is a real possibility.
But how to do this regardless of Company? Well you can have another index with your searchable attribute (Country for example) as the PK then you can query that.
Or do you have another set of data that can tie this all together? Do you have another thing that can reach it all?
If not, you may just have two items in your dynamo:
Company#1234 - Job#321 - details
Company#1234 - Country#United_states - job#321, job#456, job#1234
Company#1234 - Country#England - job#992, job#123, job#19231
your reverse index here would apply - you could do a query on PK: Contry#UnitedStates and you'd get back:
Country#United_states - Company#1234 - job#321, job #456, job31234
Country#United_states - Company#4556
Country#United_States - Comapny#8322
this isnt a relational database however! So either you have to do one of two things - use t hose job#s to then query that company and get the filter the jobs by what you want (bad - trying to avoid multiple queries!) or each job# is an attribute on country sk's, and it contains a copy of that relevant data in a map format {job title, job#, country, company, salary}. Then when they click on that job to go to the details, it makes a direct call straight to the job query, gets the details to display,and its good.
Again, it all comes down to access patterns. What do you have, and how can you arrange it in a way that lets you get what you need fast

Related

Amazon DynamoDB Single Table Design For Blog Application

New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.

Cassandra DB structure suggestion (two tables vs one)

I am new to Cassandra's DB, and I'm creating a database structure and I wonder whether my pick is optimal for my requirements.
I need to get information on unique users, and each unique user will have multiple page views.
My two options are de-normalizing my data into one big table, or create two different tables.
My most used queries would be:
Searching for all of the pages with a certain cookie_id.
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
My two suggested schemas are these:
Two tables - One for users and one for visited pages.
This would allow me to save a new user on a different table, and keep every recurring request in another table.
CREATE TABLE user_tracker.pages (
cookie_id uuid,
created_uuid timeuuid,
data_type_3 text,
data_type_4 text,
PRIMARY KEY (cookie_id, created_uuid)
);
CREATE TABLE user_tracker.users (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
This data is normalized as I don't enter the user's data for every request.
One table - For all of the data saved, and the first request as the key. First request would have data_type_1 and data_type_2.
I could also save "data_type_1" and "data_type_2" as a hashmap, as they represent multiple columns and they will always be in a single data set (is it considered to be better?).
The column "first" would be a secondary index.
CREATE TABLE user_tracker.users_pages (
cookie_id uuid,
client_id id,
data_type_1 text,
data_type_2 text,
data_type_3 text,
data_type_4 text,
first boolean,
created_uuid timeuuid,
PRIMARY KEY (cookie_id, client_id, created_uuid)
);
In reality we have more columns than 4, this was written briefly.
As far as I understand Cassandra's best practices, I'm leaning into option #2, but please do enlighten me :-)
Data modelling in Cassandra is done based on type of queries you will be making. You should be able to query using partition key.
For following option 2 mentioned by you is okay.
1.Searching for all of the pages with a certain cookie_id
Searching for all of the pages with a certain cookie_id and a client_id. If a cookie doesn't have a client, it would be marked client_id=0 (that would be most of the data).
For third query
Find the first cookie_id with extra data (for example data_type_1 + data_type_2).
Not sure what do you mean by first cookie_id with extra data. In Cassandra all data is stored by partition key and are not sorted. So all your data will be stored using cookie_id as parition key and all future instances with the same cookie_id will keep adding to this row.

Getting records structured the same way only partially

While surfing through 9gag.com, an idea (problem) came up to my mind. Let's say that I want to create a website where users can add diffirent kinds of entries. Now each entry is diffirent type and needs diffirent / additional columns.
Let's say that we can add:
a youtube video
a cite which requires the cite's author name and last name
a flash game which requires additional game category, description, genre etc.
an image which requires the link
Now all the above are all entries and have some columns in common (like id, add_date, adding_user_id, etc...) and some diffirent / additional (for example: only flash game needs description or only image needs plus_18 column to be specified). The question is how should I organize DB / code for controlling all of the above as entries together? I might want to order them, or search entries by add_date etc...
The ideas that came up to my mind:
Add a "type" column which specifies what entry it is and add all the possible columns where NULL is allowed for not related to this particular type columns. But this is mega nasty. There is no data integration.
Add some column with serialized data for the additional data but it makes any filtration a total hell.
Create a master (parent) table for an entry and separate tables for concrete entry types (their additional columns / info). But here I don't even know how I'm supposed to select data properly and is just nasty as well.
So what's the best way to solve this problem?
The parent table seems like the best option.
// This is the parent table
Entry
ID PK
Common fields
Video
ID PK
EntryID FK
Unique fields
Game
ID PK
EntryID FK
Unique fields
...
What the queries will look like will largely depend on the type of query. To, for example, get all games ordered by a certain date, the query will look something like:
SELECT *
FROM Game
JOIN Entry ON Game.EntryID = Entry.ID
ORDER BY Entry.AddDate
To get all content ordered by date, will be somewhat messy. For example:
SELECT *
FROM Entry
LEFT JOIN Game ON Game.EntryID = Entry.ID
LEFT JOIN Video ON Video.EntryID = Entry.ID
...
ORDER BY Entry.AddDate
If you want to run queries like the one above, I suggest you give unique names to your primary key fields (i.e. VideoID and GameID) so you can easily identify which type of entry you're dealing with (by checking GameID IS NOT NULL for example).
Or you could add a Type field in Entry.

how are viewing permissions usually implemented in a relational database?

What's the standard relational database idiom for setting permissions for items?
Answers should be general; however, they should be able to be applied to example below. Anything flies: adding columns, adding another table—whatever as long as it works well.
Application / Example
Assume the Twitter database is extremely simple: we have one User table, which contains a login and user id; we have a Tweet table, which contains a tweet id, tweet text, and creator id; and we have a Follower table, which contains the id of the person being followed and the follower.
Now, assume Twitter wants to enable advanced privacy settings (viewing permissions), so that users can pick exactly which followers can view tweets. The settings can be:
Everyone on Twitter
Only current followers (which would of course have to be approved by the user, this doesn't really matter though) EDIT: Current as in, I get a new follower, he sees it; I remove a follower, he stops seeing it.
Specific followers (e.g., user id 5, 10, 234, and 1)
Only the owner
Under these circumstances, what's the best way to represent viewing permissions? The priorities, in order, are speed of lookup (you want to be able to figure out what tweets to display to a user quickly), speed of creation (you don't want to take forever to post a tweet), and efficient use of space (every time I post a tweet to everyone on my followers' list, I shouldn't have to add a row for each and every follower I have to some table.)
Looks like a typical many-to-many relationship -- I don't see any restrictions on what you desire that would allow space savings wrt the typical relational DB idiom for those, i.e., a table with two columns (both foreign keys, one into users and one into tweets)... since the current followers can and do change all the time, posting a tweet to all the followers that are current at the instant of posting (I assume that's what you mean?) does mean adding that many (extremely short) rows to that relationship table (the alternative of keeping a timestamped history of follower sets so you can reconstruct who was a follower at any given tweet-posting time appears definitely worse in time and not substantially better in space).
If, on the other hand, you want to check followers at the time of viewing (rather than at the time of posting), then you could make a special userid artificially meaning "all followers of the current user" (just like you'll have one meaning "all users on Twitter"); the needed SQL to make the lookup fast, in that case, looks hairy but feasible (a UNION or OR with "all tweets for which I'm a follower of the author and the tweet is readable by [the artificial userid representing] all followers"). I'm not getting deep into that maze of SQL until and unless you confirm that it is this peculiar meaning that you have in mind (rather than the simple one which seems more natural to me but doesn't allow any space savings on the relationship table for the action of "post tweet to all followers").
Edit: the OP has clarified they mean the approach I mention in the second paragraph.
Then, assume userid is the primary key of the Users table, the Tweets table has a primary key tweetid and a foreign key author for the userid of each tweet's author, the Followers table is a typical many-to-many relationship table with the two columns (both foreign keys into Users) follower and followee, and the Canread table a not-so-typical many-to-many relationship table, still with two column -- foreign key into Users is column reader, foreign key into Tweets is column tweet (phew;-). Two special users #everybody and #allfollowers are defined with the above meanings (so that posting to everybody, all followers, or "just myself", all add only one row to Canread -- only selective posting to a specific list of N people adds N rows).
So the SQL for the set of tweet IDs a user #me can read is, I think, something like:
SELECT Tweets.tweetid
FROM Tweets
JOIN Canread ON(Tweets.tweetid=Canread.tweet)
WHERE Canread.reader IN (#me, #everybody)
UNION
SELECT Tweets.tweetid
FROM Tweets
JOIN Canread ON(Tweets.tweetid=Canread.tweet)
JOIN Followers ON(Tweets.author=Followers.followee)
WHERE Canread.reader=#allfollowers
AND Followers.follower=#me

How to design data storage for partitioned tagging system?

How to design data storage for huge tagging system (like digg or delicious)?
There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?
The tagging system basically has 3 tables:
Item (item_id, item_content)
Tag (tag_id, tag_title)
TagMapping(map_id, tag_id, item_id)
That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.
For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.
But, how to partition table TagMapping?
The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.
As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.
Is there any better solution to solve this many-to-many partition problem?
I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!
Finding Tags for a Given Item
A1. You're going to display all of the tags for a given item at once
A2. You're going to ensure that all of an item's tags are unique
Finding Items for a Given Tag
B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)
B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags
B3. You're going to sort the items for a given tag (or tags) by some measure of popularity
Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).
Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.
The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.
The rule is that you partition by field that you are going to query by. Otherwise you'll have to look through all partitions. Are you sure you'll need to query Tag table by tag_id only? I believe not, you'll also need to query by tag title. It's no so obvious for Item table, but probably you also would like to query by something like URL to find item_id for it when other user will assign tags for it.
But note, that Tag and Item tables has immutable title and URL. That means you can use the following technique:
Choose partition from title (for Tag) or URL (for Item).
Choose sequence for this partition to generate id.
You either use partition-localID pair as global identifier or use non-overlapping number sets. Anyway, now you can compute partition from both id and title/URL fields. Don't know number of partitions in advance or worrying it might change in future? Create more of them and join in groups, so that you can regroup them in future.
Sure, you can't do the same for TagMapping table, so you have to duplicate. You need to query it by map_id, by tag_id, by item_id, right? So even without partitioning you have to duplicate data by creating 3 indexes. So the difference is that you use different partitioning (by different field) for each index. I see no reason to worry about.
Most likely your queries are going to be related to a user or a topic. Meaning that you should have all info related to those in one place.
You're talking about distribution of DB, usually this is mostly an issue of synchronization. Reading, which is about 90% of the work usually, can be done on a replicated database. The issue is how to update one DB and remain consistent will all others and without killing the performances. This depends on your scenario details.
The other possibility is to partition, like you asked, all the data without overlapping. You probably would partition by user ID or topic ID. If you partition by topic ID, one database could reference all topics and just telling which dedicated DB is holding the data. You can then query the correct one. Since you partition by ID, all info related to that topic could be on that specialized database. You could partition also by language or country for an international website.
Last but not least, you'll probably end up mixing the two: Some non-overlapping data, and some overlapping (replicated) data. First find usual operations, then find how to make those on one DB in least possible queries.
PS: Don't forget about caching, it'll save you more than distributed-DB.

Resources