We have several terabytes of address data and are investigating the possibility of storing this in a DynamoDB NoSQL database. I've done quite a bit of reading on DynamoDB and NoSQL in general, but am coming from many years of MS SQL and am struggling with some of the NoSQL concepts.
My biggest question at this point is how to setup the table structure so that I can accommodate the various different ways the data could be queried. For example, in regular SQL I would expect some queries like:
WHERE Address LIKE '%maple st%' AND ZipCode = 12345
WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
Those are just examples. The actual queries could be any combination of those various fields.
It's not clear to me what my index should look like or how the table (or tables) should be structured. Can anyone get me started on understanding how that could work?
The post is relatively old, but I will try to give you an answer (maybe it will be helpful for someone having similar issues in the future).
DynamoDB is not really meant to be used in the way you describe. Its strengths are in fast (smoking fast in fact) look-ups of key/value pairs. To take your example of IP address if you wanted to really quickly look-up information associated with an IP address you could easily make the HashKey a string with the IP address and use this to do a look-up.
Things start to get complicated when you want to do queries (or scans) in dynamoDb, you can read about them here: Query and Scan in DynamDB
The gist being that scans/queries are really expensive when not performed on either the HaskKey or HaskKey+RangeKey combo (range keys are basically composite keys).
In other words I am not sure if DynamoDb is the right way to go. For smoking fast search functionality I would consider using something like Lucene. If you configure your indexes wisely you will be amazed how fast it works.
Hope this helps.
Edit:
Seems Amazon has now added support for secondary indices:
See here
DynamoDB was built to be utilized in the way the question author describes refer to this LINK where AWS documentation describes creating a secondary index like this
[country]#[region]#[state]#[county]#[city]#[neighborhood]
The partition key could be something like this as well based on what you want to look up.
In DynamoDB, you create the joins before you create the table. This means that you have to think about all the ways you intend to search for you data, create the indexes, and query your data using them.
AWS created AWS noSQL WorkBench to help teams do this. There are a few UI bugs in that application at the time of this writing; refer to LINK for more information on the bugs.
To review some of the queries you mentioned, I'll share a few possibilities in which you can create an index to create that query.
Note: noSQL means denormalized data in some cases, but not necessarily.
There are limits as to how keys should be shaped so that dynamoDB can partition actual servers to scale; refer to partition keys for more info.
The magic of dynamoDB is a well thought out model that can also handle new queries after the table is created and being used in production. There are a great deal of posts and video's online that explain how to do this.
Here is one with Rick Houlihan link. Rick Houlihan is the principle designer of DynamoDB, so go there for gospel.
To make the queries you're attempting, one would create multiple keys, mainly an initial partition key and secondary key. Rick recommends keeping them generic like PK, and SK.
Then try to shape the PK with a great deal of uniqueness e.g. A partition key of a zip code PK: "12345" could contain a massive amount of data that may be more than the 10GB quota for any partition key limit.
Example 1: WHERE Address LIKE '%maple st%' AND ZipCode = 12345
For example 1, we could shape a partition key of PK: "12345:maple"
Then just calling the PK of "12345:maple" would retrieve all the data with that zip code as well as street of maple. There will be many different PK's and that is what dynamoDB does well: scales horizontally.
Example 2: WHERE Address LIKE '%poplar ln%' AND City = 'Los Angeles' AND State = 'CA'
In example 2, we could then use the secondary index to add another way to be more specific such as PK: "12345:poplar" SK: "losangeles:ca:other:info:that:helps"
Example 3: WHERE OwnerName LIKE '%smith%' AND CountyFIPS = '00239'
For example 3, we don't have a street name. We would need to know the street name to query the data, but we may not have it in a search. This is where one would need to fully understand their base query patterns and shape the PK to be easily known at the time of the query while still being quite unique so that we do not go over the partition limits. Having a street name would probably not be the most optimal, it all depends on what queries are required.
In this last example, it may be more appropriate to add some global secondary indices, which just means making new primary key and secondary keys that map to data attribute (column) like CountyFIPS.
Related
Introduction
Hello, I'm moving to AWS because of stability, performance, etc. I will be using DynamoDB because of the always free tier that allows me to reduce my bills a lot. I was using MySQL until now. I will make the attributes simple for this example (to show the actual places where I need help and make the question shorter).
My actual DB has less than 5k rows and I expect it to grow to 20-30k in 2 years. Each user (without any group/order data) is around 600B. I don't know how this will translate to a NoSQL DB but I expect it to be less than 10MB.
What data will I have?
User:
username
password
is_group_member
Group:
capacity
access_level
Order:
oid
status
prod_id
Relationships:
User has many orders.
Group has many users.
How will I access the data and what will I get?
I will access the user by username (I won't know the group he is in). I will need to get the user's data, the group he belongs to and its data.
I will access the users that belong to a certain group. I will need to get the users' data and the group data.
I will access an order by its oid. I will need to get the user it belongs to and its data.
What I tried
I watched a series of videos by Gary Jennings, read answers on SO and also read alexdebrie's article about one-to-many relationships. My problem is that I can't seem to find an alternative that suits all the ways I will access the data.
For example:
Denormalization: it will leave me with a lot of duplicated data thus increasing the cost.
Composite primary key: I will be able to access the users by its group but how will I access the user and the group's data without knowing the group beforehand. I would need to use 2 requests making it inefficient and increasing the costs.
Secondary index + the Query API action: Again I would need to use 2 requests making it inefficient and increasing the costs.
Final questions
Did I misunderstood the alternatives? I started this question because my knowledge is not "big enough" to actually know if there is a better alternative that I can't think of so maybe I got the explanations wrong.
Is there a better alternative for this case?
If there wasn't a better alternative, what would you do in my case? Would you duplicate the group's data (thus increasing the used space and making it need only 1 request)? or would you use one of the other 2 alternatives and use 2 requests?
You're off to a great start by articulating your access patterns ahead of time.
Let's start by addressing some of your comments about data modeling in DynamoDB:
Denormalization: it will leave me with a lot of duplicated data thus increasing the cost.
When first learning DynamoDB data modeling, prior SQL Database knowledge can really get in the way. Normalizing your data is a common practice when working with SQL databases. However, denormalizing your data is a key data modeling strategy in DynamoDB.
One BIG reason you want to denormalize your data: DynamoDB doesn't have joins. Because DDB doesn't have joins, you'll be well served to pre-join your data so it can be fetched in a single query.
This blog post does a good job of explaining why denormalization is important in DDB.
Keep in mind, storage is cheap. Denomralizing your data makes for faster data access at a relatively low cost. With the size of your database, you will likely be well under the free tier threshold. Don't stress about the duplicate data!
Composite primary key: I will be able to access the users by its group but how will I access the user and the group's data without
knowing the group beforehand. I would need to use 2 requests making it inefficient and increasing the costs.
Denormalizing your data will help solve this problem (e.g. store the group info with the user). I'll give you an example of this below.
Secondary index + the Query API action: Again I would need to use 2 requests making it inefficient and increasing the costs.
You didn't share your primary key structure, so I'm not sure what scenario will require two requests. However, I will say that there may be certain situations where making two requests to DDB is a reasonable approach. Making two efficient query operations is not the end of the world.
OK, on to an example of modeling your relationships! Keep in mind that there are many ways to model data in DynamoDB. This example is not THE way. Rather, it's an example meant to demonstrate a few strategies that might help.
Here's one take of your data model:
With this arrangement, you can support the following access patterns:
Fetch user information - PK = USER#[username] SK = USER#[username]
Fetch user group - PK = USER#[username] SK begins_with GROUP#. Notice I denormalized user data in the group item. The reason for this will be apparent shortly :)
Fetch user orders - PK = USER#[username] SK begins_with ORDER#
Fetch all user data - PK = USER#[username]
To support your remaining access patterns, I created a secondary index. The primary key and sort key of the secondary index is swapped with the primary key/sort key of the base table. This pattern is called an inverted index. The secondary index looks like this:
This secondary index supports the following access patterns:
Fetch Group users - PK = GROUP#[grouped]
Fetch Order by oid - PK = ORDER#[oid]
You can see that I denormalized the User and Group relationship by repeating user data in the item representing the Group. This helps me with the "fetch group users" access pattern.
Again, this is just one way you can achieve the access patterns you described. There are many strategies, but many will require that you abandon some of the best practices you learned working with SQL databases!
I'm trying to find the best data model to adapt a very big mysql table in Cassandra.
This table is structured like this:
CREATE TABLE big_table (
social_id,
remote_id,
timestamp,
visibility,
type,
title,
description,
other_field,
other_field,
...
)
A page (which is not here) can contain many socials, which can contain many remote_ids.
Social_id is the partitioning key, remote_id and timestamp are the clustering key: "Remote_id" gives unicity, "Time" is used to order the results. So far so good.
The problem is that users can also search on their page contents, filtering by one or more socials, one or more types, visibility (could be 0,1,2), a range of dates or even nothing at all.
Plus, based on the filters, users should be able to set visibility.
I tried to handle this case, but I really can find a sustainable solution.
The best I've got is to create another table, which I need to keep up with the original one.
This table will have:
page_id: partition key
timestamp, social_id, type, remote_id: clustering key
Plus, create a Materialized View for each combination of filters, which is madness.
Can I avoid creating the second table? What wuold be the best Cassandra model in this case? Should I consider switching to other technologies?
I start from last questions.
> What would be the best Cassandra model in this case?
As stated in Cassandra: The Definitive Guide, 2nd edition (which I highly recommend to read before choosing or using Cassandra),
In Cassandra you don’t start with the data model; you start with the query model.
You may want to read an available chapter about data design at Safaribooksonline.com. Basically, Cassandra wants you to think about queries only and don't care about normalization.
So the answer on
> Can I avoid creating the second table?
is You shouldn't avoiding it.
> Should I consider switching to other technologies?
That depends on what you need in terms of replication and partitioning. You may end up creating master-master synchronization based on RDBMS or something else. In Cassandra, you'll end up with duplicated data between tables and that's perfectly normal for it. You trade disk space in exchange for reading/writing speed.
> how to filter and update a big table dynamically?
If after all of the above you still want to use normalized data model in Cassandra, I suggest you look on secondary indexes at first and then move on to custom indexes like Lucene index.
I have an ERP application with about 50 small lookup tables containing non-transactional data. Examples are ItemTypes, SalesOrderStatuses etc. There are so many different types and categories and statuses and with every new module new lookup tables are being added. I have a service to provide List objects out of these tables. These tables usually contain only two columns, (Id and Description). They have only a couple of rows, 8 - 10 rows at max.
I am thinking about putting all of them in one table with ID, Description and LookupTypeID. With this one table I will be able to get rid of 50 tables. Is it good idea? Bad Idea? Very bad idea?
Are there any standards/best-practices for managing small lookup tables?
Among some professionals, the single common lookup table is a design error you should avoid. At the very least, it will slow down performance. The reason is that you will have to have a compound primary key for the common table, and lookups via a compound key will take longer than lookups via a simple key.
According to Anith Sen, this is the first of five design errors you should avoid. See this article: Five Simple Design Errors
Merging lookup tables is a bad idea if you care about integrity of your data (and you should!):
It would allow "client" tables to reference the data they were not meant to reference. E.g. the DBMS will not protect you from referencing SalesOrderStatuses where only ItemTypes should be allowed - they are now in the same table and you cannot (easily) separate the corresponding FKs.
It would force all lookup data to share the same columns and types.
Unless you have a performance problems due to excessive JOINs, I recommend you stay with your current design.
If you do, then you could consider using natural instead of surrogate keys in the lookup tables. This way, the natural keys gets "propagated" through foreign keys to the "client" tables, resulting in less need for JOINing, at the price of increased storage space. For example, instead of having ItemTypes {Id PK, Description AK}, only have ItemTypes {Description PK}, and you no longer have to JOIN with ItemTypes just to get the Description - it was automatically propagated down the FK.
You can store them in a text search (ie nosql) database like Lucene. They are ridiculously fast.
I have implemented this to great effect. Note though that there is some initial setup to overcome, but not much. Lucene queries on ids are a snap to write.
The "one big lookup table" approach has the problem of allowing for silly values -- for example "color: yellow" for trucks in the inventory when you only have cars with "color: yellow". One Big Lookup Table: Just Say No.
Off-hand, I would go with the natural keys for the lookup tables unless you would have cases like "the 2012 model CX300R was red but the 2010-2011 models CX300R were blue (and model ID also denotes color)".
Traditionally if you ask a DBA they will say you should have separate tables. If you asked a programmer they would say using the single table is easier. (Makes making a Edit Status webpage very easy you just make one webpage and pass it a different LookupTypeID instead of lots of similar pages)
However now with ORM the SQL and Code to access different status tables is not really any extra effort.
I have used both method and both work fine. I must admit using a single status table is easiest. I have done this for small apps and also enterprise apps and have noticed no performance impacts.
Finally the other field I normally like to add on these generic status tables is a OrderBy field so you can sort the status in your UI by something other than the description if needed.
Sounds like a good idea to me. You can have the ID and LookupTypeID as a multi-attribute primary key. You just need to know what all of the different LookupTypeIDs represent and you should be good as gold.
EDIT: As for the standards/best-practices, I honestly don't have an answer for you. I've only had one semester of SQL/database design so I haven't been all too exposed to the matter.
I've just started looking into Amazon's DynamoDB. Obviously the scalability appeals, but I'm trying to get my head out of SQL mode and into no-sql mode. Can this be done (with all the scalability advantages of dynamodb):
Have a load of entries (say 5 - 10 million) indexed by some number. One of the fields in each entry will be a creation date. Is there an effective way for dynamo db to give my web app all the entries created between two dates?
A more simple question - can dynamo db give me all entries in which a field matches a certain number. That is, there'll be another field that is a number, for argument's sake lets say between 0 and 10. Can I ask dynamodb to give me all the entries which have value e.g. 6?
Do both of these queries need a scan of the entire dataset (which I assume is a problem given the dataset size?)
many thanks
Is there an effective way for dynamo db to give my web app all the
entries created between two dates?
Yup, please have a look at the of the Primary Key concept within Amazon DynamoDB Data Model, specifically the Hash and Range Type Primary Key:
In this case, the primary key is made of two attributes. The first
attributes is the hash attribute and the second one is the range
attribute. Amazon DynamoDB builds an unordered hash index on the hash
primary key attribute and a sorted range index on the range primary
key attribute. [...]
The listed samples feature your use case exactly, namely the Reply ( Id, ReplyDateTime, ... ) table facilitates a primary key of type Hash and Range with a hash attribute Id and a range attribute ReplyDateTime.
You'll use this via the Query API, see RangeKeyCondition for details and Querying Tables in Amazon DynamoDB for respective examples.
can dynamo db give me all entries in which a field matches a certain
number. [...] Can I ask dynamodb to give
me all the entries which have value e.g. 6?
This is possible as well, albeit by means of the Scan API only (i.e. requires to read every item in the table indeed), see ScanFilter for details and Scanning Tables in Amazon DynamoDB for respective examples.
Do both of these queries need a scan of the entire dataset (which I
assume is a problem given the dataset size?)
As mentioned the first approach works with a Query while the second requires a Scan, and Generally, a query operation is more efficient than a scan operation - this is a good advise to get started, though the details are more complex and depend on your use case, see section Scan and Query Performance within the Query and Scan in Amazon DynamoDB overview:
For quicker response times, design your tables in a way that can use
the Query, Get, or BatchGetItem APIs, instead. Or, design your
application to use scan operations in a way that minimizes the impact
on your table's request rate. For more information, see Provisioned Throughput Guidelines in Amazon DynamoDB.
So, as usual when applying NoSQL solutions, you might need to adjust your architecture to accommodate these constraints.
I'm trying to design a database for a product aggregator. Each product has information about where it comes from, what it costs, what type of thing it is, price, color, etc. Users need to able to search and filter results based on any of those product categories. I also expect to have a large number of users. My initial thought was having one big table with every product in it with a column for each piece of information and an index on anything I need to be able to search by but I think this might be inefficient with a lot of users pounding on this one table. My other thought was to organize the database to promote a tree-like navigation of tables but because you can search by anything I'm not sure how I would organize the tables.
Any thoughts on some good practices?
One table of products - databases are designed to have lots of users pounding on tables.
(from the comments)
You need to model your data. This comes from looking at the all the data you have, determining what is related to what (a table is called a relation because all the attributes in a row are related to a candidate key). You haven't really given enough information about the scope of what data (unstructured?) you have on these products and how it varies. Are you going to have difficulties because Shoes have brand, model, size and color, but Desks only have brand, model and finish? All this is going to inform your data model. Typically you have one products table, and other things link to it.
Some of those attributes will be foreign keys to lookup tables, others (price) would be simple scalars. Appropriate indexing and you'll be fine. For advanced analytics, consider a dimensionally modeled star-schema, but perhaps not for your live transaction system - depends what your data flow/workflow/transactions are. Or consider some benefits of its principles in your transactional database. Ralph Kimball is source of good information on dimensional modeling.
I dont see any need for the tree structure here. You can do with single table.
if you insist on tree structure with hierarchy here is an example to get you started.
For text based search, and ease of startup & design, I strongly recommend Apache SOLR. The SOLR API is easy to use (especially JSON). Databases do text search poorly, and I would instead recommend that you just make sure that they respond to primary/unique key queries properly, and those are the fields you should index.
One table for the products, and another table for the product category hierarchy (you don't specifically say you have this but "tree-like navigation of tables" makes me think you might).
I can see you might be concerned about over-indexing causing problems if you plan to index almost every column. In that case, it might be best to index on the top 5 or 10 columns you think users are likely to search for, unless it's possible for a user to search on ANY column. In that case you might want to look at building a data warehouse. Maybe you'll want to look into data cubes to see if those will help...?
For hierarchical data, you need a PRODUCT_CATEGORY table looking something like this:
ID
PARENT_ID
NAME
Some sample data:
ID PARENT_ID NAME
1 ROOT
2 1 SOCKS
3 1 HELICOPTER PARTS
4 2 ARGYLE
Some SQL engines (such as Oracle) allow you to write recursive queries to traverse the hierarchy in a single query. In this example, the root of the tree has a PARENT_ID of NULL, but if you don't want this column to be nullable, I've also seen -1 used for the same purposes.