Adapting one-to-many relationship to DynamoDB (NoSQL) - database

Introduction
Hello, I'm moving to AWS because of stability, performance, etc. I will be using DynamoDB because of the always free tier that allows me to reduce my bills a lot. I was using MySQL until now. I will make the attributes simple for this example (to show the actual places where I need help and make the question shorter).
My actual DB has less than 5k rows and I expect it to grow to 20-30k in 2 years. Each user (without any group/order data) is around 600B. I don't know how this will translate to a NoSQL DB but I expect it to be less than 10MB.
What data will I have?
User:
username
password
is_group_member
Group:
capacity
access_level
Order:
oid
status
prod_id
Relationships:
User has many orders.
Group has many users.
How will I access the data and what will I get?
I will access the user by username (I won't know the group he is in). I will need to get the user's data, the group he belongs to and its data.
I will access the users that belong to a certain group. I will need to get the users' data and the group data.
I will access an order by its oid. I will need to get the user it belongs to and its data.
What I tried
I watched a series of videos by Gary Jennings, read answers on SO and also read alexdebrie's article about one-to-many relationships. My problem is that I can't seem to find an alternative that suits all the ways I will access the data.
For example:
Denormalization: it will leave me with a lot of duplicated data thus increasing the cost.
Composite primary key: I will be able to access the users by its group but how will I access the user and the group's data without knowing the group beforehand. I would need to use 2 requests making it inefficient and increasing the costs.
Secondary index + the Query API action: Again I would need to use 2 requests making it inefficient and increasing the costs.
Final questions
Did I misunderstood the alternatives? I started this question because my knowledge is not "big enough" to actually know if there is a better alternative that I can't think of so maybe I got the explanations wrong.
Is there a better alternative for this case?
If there wasn't a better alternative, what would you do in my case? Would you duplicate the group's data (thus increasing the used space and making it need only 1 request)? or would you use one of the other 2 alternatives and use 2 requests?

You're off to a great start by articulating your access patterns ahead of time.
Let's start by addressing some of your comments about data modeling in DynamoDB:
Denormalization: it will leave me with a lot of duplicated data thus increasing the cost.
When first learning DynamoDB data modeling, prior SQL Database knowledge can really get in the way. Normalizing your data is a common practice when working with SQL databases. However, denormalizing your data is a key data modeling strategy in DynamoDB.
One BIG reason you want to denormalize your data: DynamoDB doesn't have joins. Because DDB doesn't have joins, you'll be well served to pre-join your data so it can be fetched in a single query.
This blog post does a good job of explaining why denormalization is important in DDB.
Keep in mind, storage is cheap. Denomralizing your data makes for faster data access at a relatively low cost. With the size of your database, you will likely be well under the free tier threshold. Don't stress about the duplicate data!
Composite primary key: I will be able to access the users by its group but how will I access the user and the group's data without
knowing the group beforehand. I would need to use 2 requests making it inefficient and increasing the costs.
Denormalizing your data will help solve this problem (e.g. store the group info with the user). I'll give you an example of this below.
Secondary index + the Query API action: Again I would need to use 2 requests making it inefficient and increasing the costs.
You didn't share your primary key structure, so I'm not sure what scenario will require two requests. However, I will say that there may be certain situations where making two requests to DDB is a reasonable approach. Making two efficient query operations is not the end of the world.
OK, on to an example of modeling your relationships! Keep in mind that there are many ways to model data in DynamoDB. This example is not THE way. Rather, it's an example meant to demonstrate a few strategies that might help.
Here's one take of your data model:
With this arrangement, you can support the following access patterns:
Fetch user information - PK = USER#[username] SK = USER#[username]
Fetch user group - PK = USER#[username] SK begins_with GROUP#. Notice I denormalized user data in the group item. The reason for this will be apparent shortly :)
Fetch user orders - PK = USER#[username] SK begins_with ORDER#
Fetch all user data - PK = USER#[username]
To support your remaining access patterns, I created a secondary index. The primary key and sort key of the secondary index is swapped with the primary key/sort key of the base table. This pattern is called an inverted index. The secondary index looks like this:
This secondary index supports the following access patterns:
Fetch Group users - PK = GROUP#[grouped]
Fetch Order by oid - PK = ORDER#[oid]
You can see that I denormalized the User and Group relationship by repeating user data in the item representing the Group. This helps me with the "fetch group users" access pattern.
Again, this is just one way you can achieve the access patterns you described. There are many strategies, but many will require that you abandon some of the best practices you learned working with SQL databases!

Related

How can I store an indefinite amount of stuff in a field of my database table?

Heres a simple version of the website I'm designing: Users can belong to one or more groups. As many groups as they want. When they log in they are presented with the groups the belong to. Ideally, in my Users table I'd like an array or something that is unbounded to which I can keep on adding the IDs of the groups that user joins.
Additionally, although I realize this isn't necessary, I might want a column in my Group table which has an indefinite amount of user IDs which belong in that group. (side question: would that be more efficient than getting all the users of the group by querying the user table for users belonging to a certain group ID?)
Does my question make sense? Mainly I want to be able to fill a column up with an indefinite list of IDs... The only way I can think of is making it like some super long varchar and having the list JSON encoded in there or something, but ewww
Please and thanks
Oh and its a mysql database (my website is in php), but 2 years of php development I've recently decided php sucks and I hate it and ASP .NET web applications is the only way for me so I guess I'll be implementing this on whatever kind of database I'll need for that.
Your intuition is correct; you don't want to have one column of unbounded length just to hold the user's groups. Instead, create a table such as user_group_membership with the columns:
user_id
group_id
A single user_id could have multiple rows, each with the same user_id but a different group_id. You would represent membership in multiple groups by adding multiple rows to this table.
What you have here is a many-to-many relationship. A "many-to-many" relationship is represented by a third, joining table that contains both primary keys of the related entities. You might also hear this called a bridge table, a junction table, or an associative entity.
You have the following relationships:
A User belongs to many Groups
A Group can have many Users
In database design, this might be represented as follows:
This way, a UserGroup represents any combination of a User and a Group without the problem of having "infinite columns."
If you store an indefinite amount of data in one field, your design does not conform to First Normal Form. FNF is the first step in a design pattern called data normalization. Data normalization is a major aspect of database design. Normalized design is usually good design although there are some situations where a different design pattern might be better adapted.
If your data is not in FNF, you will end up doing sequential scans for some queries where a normalized database would be accessed via a quick lookup. For a table with a billion rows, this could mean delaying an hour rather than a few seconds. FNF guarantees a direct access lookup path for each item of data.
As other responders have indicated, such a design will involve more than one table, to be joined at retrieval time. Joining takes some time, but it's tiny compared to the time wasted in sequential scans, if the data volume is large.

Database Design to handle newsfeed for different activities

I am going to create a new project, where I need users to view their friends activities and actions just like Facebook and LinkedIn.
Each user is allowed to do 5 different types of activities, each activity have different attributes, for example activity X can be public/private for while activity Y will be assigned to categories. Some of actions include 1 users others have 2 or 3 ...etc. Eventually I have to aggregate all these 5 different types of activities on the news feed page.
How can I design a database that is efficient?
I have 3 designs in mind, please let me know your thoughts. Any new ideas will be greatly appreciated!
1- Separate tables: since there are nearly 3-4 different columns for each activity, it would be logical to separate each activity to its own table.
Pros: Clean database, and easy to develop.
Cons: It will need to query the database 5 times and aggregate results to make a single newsfeed page.
2- One big table: This table will hold all activities with many unused columns. A new numeric column will be added called "type" which will indicate the type of activity. Some attributes could be combined in an HStore field (since we are using Postgres), others will be queried a lot so I dont think it is a good thing to include them as in an HStore field.
Pros: Easy to pull newsfeed.
Cons: Lots of read/writes on the same table, the code will be a bit messier so is the database.
3- Hybrid: A solution would be to make one table containing all the newsfeed, with a polymorphic association to other tables that contain details of each specific activity.
Pros: Tidy code and database, easy to add new activities.
Cons: JOIN ALL THE TABLES to make a single newsfeed! Still better than making 5 different queries.
As I am writing this post I am starting to lean towards solution number 2. Please advise!
Thanks
I would consider a graph database for this. Neo4j. It will add very flexible attributes on either nodes (users) or links (types of relations).
For small sets and few joins, SQL databases are faster and more appropriate. But if your starting point is 5 table joins, graph databases seem simpler and offer similar performance (if not better).

Are there situations where it makes sense to store redundant data in a database, to increase speed?

In the following situation:
user has many bags
bag has many items
users
-id
bags
-id
-user_id
items
-id
-bag_id
There are 2 ways to get access to a user's items.
1) An instance method can be added to a user, which iterates over each of the user's bags, and collects the bag's items into an array to return. In Ruby on Rails, something like:
#in user.rb
def items
items = []
bags.includes(:items).each { |bag| items += bag.items }
end
2) Add a user_id attribute directly to the items table, and add an additional relationship, so that user has many items. Then just do user.items.
The 2nd method would be faster, but involves storing redundant data. Are there situations where it makes sense to implement it?
Yes, there are certain circumstances where it does make sense to introduce some controlled redundancy for the sake of performance. Generally, this should only be done when a database is unable to meet its performance requirements. This is called "denormalisation" and what you have to consider is that it:
Can make implementation more complex
Often sacrifices flexibility 
May speed up retrievals but slow down updates (as you now have to update more than one location)
So it's something to consider in cases where performance is unsatisfactory and a relation has a low update rate and high query rate.
There are also some denormalised database designs, such as the star schema, which are used in database warehousing.
Yes. In particular, reporting databases, data marts and data warehouses often use design principles that knowingly depart from some of the normalization rules. The result is a database that has some redundancy in it, but is not only faster to query, but easier as well.
Ease of query is particularly important when there is an analytic GUI between the database and the database user. These analytic tools are quite a bit easier to master if certain design principles are followed in database design. Normalization isn't particularly helpful in this regard.
Unnormalized design need not mean undisciplined design. In particular, it's worth boning up on star schema and snowflake schema designs if you plan on building a reporting database, a data mart, or a data warehouse. The process by which a star or snowflake schema is kept up to date, sometimes called ETL (extract-transform-load), has to be carefully written so as to prevent controlled redundancy from resulting in self contradictory data.
In transaction oriented databases, normalized is generally better, although many experts don't try to push it beyond Boyce-Codd normal form.
Unless there's something you haven't told us, a table like this
create table bagged_items (
user_id integer not null,
bag_id integer not null,
item_id integer not null,
primary key (user_id, bag_id, item_id)
);
is in at least 5NF. It's all key. There's not a bit of redundant data there.
What you've done isn't normalization; normalization is based on identifying certain kinds of dependencies, and reducing their effects by projection. And what you've done isn't denormalization, either; denormalization is an undoing of normalization.
You've simply split a primary key into pieces. I don't pretend to know what principle you followed in order to justify that. It looks a little like "no table may have more than one foreign key" normal form. (But, of course, there's no such thing.)
For combining records from two SQL tables, databases implement efficient JOIN methods that can be used from Ruby on Rails. For almost all applications this is fast enough. That being said, for certain high performances stores, you might want to store redundant data as you suggested, but this comes at the cost of having to keep the data in sync on writes.

Property address database design in DynamoDB NoSQL

We have several terabytes of address data and are investigating the possibility of storing this in a DynamoDB NoSQL database. I've done quite a bit of reading on DynamoDB and NoSQL in general, but am coming from many years of MS SQL and am struggling with some of the NoSQL concepts.
My biggest question at this point is how to setup the table structure so that I can accommodate the various different ways the data could be queried. For example, in regular SQL I would expect some queries like:
WHERE Address LIKE '%maple st%' AND ZipCode = 12345
WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
Those are just examples. The actual queries could be any combination of those various fields.
It's not clear to me what my index should look like or how the table (or tables) should be structured. Can anyone get me started on understanding how that could work?
The post is relatively old, but I will try to give you an answer (maybe it will be helpful for someone having similar issues in the future).
DynamoDB is not really meant to be used in the way you describe. Its strengths are in fast (smoking fast in fact) look-ups of key/value pairs. To take your example of IP address if you wanted to really quickly look-up information associated with an IP address you could easily make the HashKey a string with the IP address and use this to do a look-up.
Things start to get complicated when you want to do queries (or scans) in dynamoDb, you can read about them here: Query and Scan in DynamDB
The gist being that scans/queries are really expensive when not performed on either the HaskKey or HaskKey+RangeKey combo (range keys are basically composite keys).
In other words I am not sure if DynamoDb is the right way to go. For smoking fast search functionality I would consider using something like Lucene. If you configure your indexes wisely you will be amazed how fast it works.
Hope this helps.
Edit:
Seems Amazon has now added support for secondary indices:
See here
DynamoDB was built to be utilized in the way the question author describes refer to this LINK where AWS documentation describes creating a secondary index like this
[country]#[region]#[state]#[county]#[city]#[neighborhood]
The partition key could be something like this as well based on what you want to look up.
In DynamoDB, you create the joins before you create the table. This means that you have to think about all the ways you intend to search for you data, create the indexes, and query your data using them.
AWS created AWS noSQL WorkBench to help teams do this. There are a few UI bugs in that application at the time of this writing; refer to LINK for more information on the bugs.
To review some of the queries you mentioned, I'll share a few possibilities in which you can create an index to create that query.
Note: noSQL means denormalized data in some cases, but not necessarily.
There are limits as to how keys should be shaped so that dynamoDB can partition actual servers to scale; refer to partition keys for more info.
The magic of dynamoDB is a well thought out model that can also handle new queries after the table is created and being used in production. There are a great deal of posts and video's online that explain how to do this.
Here is one with Rick Houlihan link. Rick Houlihan is the principle designer of DynamoDB, so go there for gospel.
To make the queries you're attempting, one would create multiple keys, mainly an initial partition key and secondary key. Rick recommends keeping them generic like PK, and SK.
Then try to shape the PK with a great deal of uniqueness e.g. A partition key of a zip code PK: "12345" could contain a massive amount of data that may be more than the 10GB quota for any partition key limit.
Example 1: WHERE Address LIKE '%maple st%' AND ZipCode = 12345
For example 1, we could shape a partition key of PK: "12345:maple"
Then just calling the PK of "12345:maple" would retrieve all the data with that zip code as well as street of maple. There will be many different PK's and that is what dynamoDB does well: scales horizontally.
Example 2: WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
In example 2, we could then use the secondary index to add another way to be more specific such as PK: "12345:poplar" SK: "losangeles:ca:other:info:that:helps"
Example 3: WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
For example 3, we don't have a street name. We would need to know the street name to query the data, but we may not have it in a search. This is where one would need to fully understand their base query patterns and shape the PK to be easily known at the time of the query while still being quite unique so that we do not go over the partition limits. Having a street name would probably not be the most optimal, it all depends on what queries are required.
In this last example, it may be more appropriate to add some global secondary indices, which just means making new primary key and secondary keys that map to data attribute (column) like CountyFIPS.

Database design for a product aggregator

I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?
One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.
I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.
For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.
One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.

Resources