I am new to DynamoDB and I have a big mess of: how my tables should be look like.
I have read the posts here: (its recommended for who didn't read it yet)
http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/BestPractices.html
And now I have some dilemmas that I think everyone who start using DynamoDB will have.
First,
my tables: STUDENTS, TEAMS, PROJECTS
STUDENTS: id, age ...
TEAMS: id, student-1-id, student-2-id, current-project, prev-project, last-updated-on
PROJECTS: id, team-id, list of questions, list student1answers, list student2answers
some comments:
as you can see I don't use range-key. Do I need to?.
each answer is a json of (number of question, text, date of inserted)
every student can be in multiple teams.
My dilemmas:
I want to get all the teams of a specific student that updated after specific date.
for now I am using 2 scans operations: one search the student1 and the second search the student2.
**Is there a better way ?**
I have thought about adding a new table: user-Battles: student-id, team-id
so i can query the teams for the specific students and then batch_get_item all the teams
but what with the last-update-on? how can I also query by this inside the batch_get_item ?
When a project overs I don't use it anymore. what to do with the old items ?
delete ? Move them to another table ?
In the project table, the attributes that can be updated are the answers attributes
so I think to move them to another table for performances.
Do I really need to move them if its updated just twice? (when student1 send answer and when student2 send answer - and then the project is old)
*If I create a new table for the answers I will not have to store them in a JSON format
How would you design the tables? Please let me know.
Nice question with lot of details :)
If I had only one advise, it would be:
keep in mind that, with NoSQL, it is not only OK but normal, even recommended to de-normalize your data.
This said, for you "dilemna", your suggestion was pretty good. You should de-normalize with the date as the range_key. One way could be to add a table like this:
hash_key: student
range_key: date
team: team_id
But still, this is not perfect as the table would keep on growing. Each updating inserting a new object. Indeed, it is not possible to edit a key. You would have to do your own cleaning code.
In DynamoDB, you do not have to worry about performance slowdown caused by "old" items(except for scan), this is the main strength of DynamoDB. Nonetheless, this is always a good practice to keep data clean but be consistent. If you start moving expired projects then, move all of them or you will end up not knowing where your data are.
Last suggestion: are you sure "ids" are the best thing to describe your objects ? Most of the time, a name, date or any unique attribute makes a better key.
Related
There is a Book table that is always unique with title, edition, and author.
And I want all bookstores to add their books, but different bookstores can have the same book with different pricing. So I come up with this table design.
So when one bookstore tries to add a book and the book is already been added by another bookstore the current bookstore should have to just fill in the pricing detail, not including the book detail.
The problem with this is, what if the book detail already been added has some missing or incorrect info? in this case, the current bookstore can flag and moderators or admins can fix it.
Is there any better way to achieve this? I don't comfortable with this design logic at all.
Your design makes sense. You want to keep the "static" information in 1 table, and link "dynamic" information like you did.
Your other question is related to data integrity. You can put "not null" conditions on fields to ensure all fields are filed, but garbage entries are always possible. This is a universal problem.
Potential solutions to mitigate this:
any and all data that can be selected instead of typed in should be linked via another table. Ex:
BookGenre
bookgenreid PK
genre CHAR
Book
bookid PK
genre FK, BookGenre.bookgenreid
...
So you store all possible genres in a separate table, so your users cannot invent new genres or mistype values. Same for authors, countries, ... This makes it easier to build queries as well and avoid things like [ SciFi, Science Fiction, Sciance fiction, ... ]
not everyone should be able to enter new books in the system. Ex. when I worked at a wholesale distributor, only a select group employees could create new products in the database, and they had established a convention on how to do it. They worked closely with purchasing and receiving. You will need to dedicate "data administrators".
So try to control as much as you can in the database and - or the application. Avoid free text fields as much as possible, as users will always think of new ways to mess it up. Ex. at work currently we have a HUGE project to standardise addresses between unlinked systems. It is a enormous undertaking, which involves AI. All this only because no 2 persons enter addresses exactly the same.
New to this community. I need some help in designing the Amazon Dynamo DB table for my personal projects.
Overview, this is a simple photo gallery application with following attributes.
UserID
PostID
List item
S3URL
Caption
Likes
Reports
UploadTime
I wish to perform the following queries:
For a given user, fetch 'N' most recent posts
For a given user, fetch 'N' most liked posts
Give 'N' most recent posts (Newsfeed)
Give 'N' most liked posts (Newsfeed)
My solution:
Keeping UserID as the partition key, PostID as the sort key, likes and UploadTime as the local secondary index, I can solve the first two query.
I'm confused on how to perform query operation for 3 and 4 (Newsfeed). I know without partition ket I cannot query and scan is not an effective solution. Any workaround for operatoin 3 and 4 ?
Any idea on how should I design my DB ?
It looks like you're off to a great start with your current design, well done!
For access pattern #3, you want to fetch the most recent posts. One way to approach this is to create a global secondary index (GSI) to aggregate posts by their creation time. For example, you could create a variable named GSI1PK on your main table and assign it a value of POSTS and use the upload_time field as the sort key. That would look something like this:
Viewing the secondary index (I've named it GSI1), your data would look like this:
This would allow you to query for Posts and sort by upload_time. This is a great start. However, your POSTS partition will grow quite large over time. Instead of choosing POSTS as the partition key for your secondary index, consider using a truncated timestamp to group posts by date. For example, here's how you could store posts by the month they were created:
Storing posts using a truncated timestamp will help you distribute your data across partitions, which will help your DB scale. If a month is too long, you could use truncated timestamps for a week/day/hour/etc. Whatever makes sense.
To fetch the N most recent posts, you'd simply query your secondary index for POSTS in the current month (e.g. POSTS#2021-01-00). If you don't get enough results, run the same query against the prior month (e.g. POSTS#2020-12-00). Keep doing this until your application has enough posts to show the client.
For the fourth access pattern, you'd like to fetch the most liked posts. One way to implement this access pattern is to define another GSI with "LIKES" as the partition key and the number of likes as the sort key.
If you intend on introducing a data range on the number of likes (e.g. most popular posts this week/month/year/etc) you could utilize the truncated timestamp approach I outlined for the previous access pattern.
When you find yourself "fetch most recent" access patterns, you may want to check out KSUIDs. KSUIDs, or K-sortable Universal Identifier, are unique identifiers that are sortable by their creation date/time/. Think of them as UUID's and timestamps combined into one attribute. This could be useful in supporting your first access pattern where you are fetching most recent posts for a user. If you were to use a KSUID for the Post ID, your table would look like this:
I've replaced the POST ID's in this example with KSUIDs. Because the KSUIDs are unique and sortable by the time they were created, you are able to support your first access pattern without any additional indexing.
There are KSUID libraries for most popular programming languages, so implementing this feature is pretty simple.
You could add two Global Secondary Indexes.
For 3):
Create a static attribute type with the value post, which serves as the Partition Key for the GSI and use the attribute UploadTime as the Sort Key. You can then query for type="post" and get the most recent items based on the sort key.
The solution for 4) is very similar:
Create another Global secondary index with the aforementioned item type as the partition key and Likes as the sort key. You can then query in a similar way as above. Note, that GSIs are eventually consistent, so it may take time until your like counters are updated.
Explanation and additional infos
Using this approach you group all posts in a single item collection, which allows for efficient queries. To save on storage space and RCUs, you can also choose to only project a subset of attributes into the index.
If you have more than 10GB of post-data, this design isn't ideal, but for a smaller application it will work fine.
If you're going for a Single Table Design, I'd recommend to use generic names for the Index attributes: PK, SK, GSI1PK, GSI1SK, GSI2PK, GSI2SK. You can then duplicate the attribute values into these items. This will make it less confusing if you store different entities in the table. Adding a type column that holds the entity type is also common.
I have a system where users can vote on entities, if they like or hate them. It will be bazillion votes and trazillion records, hopefully, some time in the future :)
At the moment i store a vote in an Entity like this:
UserRecordVote: recordId, userId, hateOrLike
And when i want to get every Record the user liked i do a query like this:
I query the "UserRecordVote" table for all the "likes", then i take the recordIds from that resultset, create a key of that property and get the record from the Record Table.
Then i aggregate all that in a list and return it.
Here's the question:
I came up with a different approach and i want to find out if that one is 1. faster and 2. how much is the difference in cost.
I would create an Entity which's name would be userId + "likes" and the key would be the record id:
new Entity(userId + "likes", recordId)
So when i would do a query to get all the likes i could simply query for all, no filters needed. AND i could just grab the entity key! which would be much cheaper if i remember the documentation of app engine right. (can't find the pricing page anymore). Then i could take the Iterable of keys and do a single get(Iterable keys). Ok so i guess this approach is faster and cheaper right? But what if i want to grab all the votes of a user or better said, i want to grab all the records a user didn't vote on yet.
Here's the real question:
I wan't to load all the records a user didn't vote on yet:
So i would have entities like this:
new Entity(userId+"likes", recordId);
and
new Entity(userId+"hates", recordId);
I would query both vote tables for all entity keys and query the record table for all entity keys. Then i would remove all the record entity keys matching one of the vote entity keys and with the result i would get(Iterable keys) the full entities and have all the record entites which are not in one of the two voting tables.
Is that a useful approach? Is that the fastest and cost efficient way to do a datastore query? Am i totally wrong and i should store the information as list properties?
EDIT:
With that approach i would have 2 entity groups for each user, which would result in million different entity groups, how would GAE Datastore handle that? Atleast the Datastore Viewer entity select box would probably crash :) ?
To answer the Real Question, you probably want to have your hateOrLike field store an integer that indicates either hated/liked/notvoted. Then you can filter on hateOrLike=notVoted.
The other solutions you propose with the dynamically named entities make it impossible to query on other aspects of your entities, since you don't know their names.
The other thing is you expect this to be huge, you likely want to keep a running counter of your votes rather than tabulating every time you pull up a UserRecord - querying all the votes, and then calculating them on each view is very slow - especially since App Engine will only return 1000 results on each query, and if you have more than 1000 votes, you'll have to keep making repeated queries to get all the results.
If you think people will vote quickly, you should look into using a sharded counter for performance. There's examples of that with code available if you do a google search.
Consider serializing user hate/like votes in two separate TextProperties inside the entity. Use the userId as key_name.
rec = UserRecordVote.get_by_key_name(userId)
hates = len(rec.hates.split('_'))
etc.
My spider sense is tingling, but I've been thinking about it for 2 hours now and I'd like some more feedback from the hivemind.
I'm creating an application for a school. Its supposed to handle students, teachers, courses, honor roles, grades - the works.
I was wondering how to handle the change of years after each year.
Students move up a grade (or don't).
Teachers are assigned to different grades as their homeroom teacher.
Grades are saved for the year.
There's also the matter of auditing. I need to have an easy way to pull up records from last year or the year before. See what teacher gave which course at what grade at what year.
The problem I'm having is how to handle this.
My thought was to create a new clean database for each year as they come along. So at the end of this year, I'd go to the school and create a new database for them named FooSchool2012 and programatically let the end users change the database they want to use via a connection string.
Since I'm using an ORM it's only a matter of changing the connection string as the databases are the same.
But this reeks of bad design and crappy engineering to me.
Usually my gut is right, so hopefully you guys can let me know of some alternatives on how to handle this.
No, I would not create a new table or database for each year. It breaks first normal form. Every table will be a duplicate except for the name. It's a poor design. And a maintenance headache. Who's going to create the new database, load the schema, and then change all the URLs? If you change the schema after a few years, will you have to change all the back editions as well so people can query the historical data?
Nope, not a good design at all.
It's common to move historical information out into reporting/data warehousing databases. But the scheme you're suggesting is reminiscent of old, mainframe, VSAM flat file methodologies. I'd use relational databases the way they were intended to be used.
I'm sure your solution could be made workable, but it does seem a little needlessly complicated. Couldn't you accomplish the same thing in a single database by referencing the school year? You may want to think about which entities make sense to have "effective dates" (i.e., a start and an end time). The 3rd grade teacher may change mid year, for example, but you could handle that with effective dates.
My thought was to create a new clean
database for each year
If you thought about this for two hours, and your best idea was to create a new database for each year, you're the wrong person to design this database.
That's an observation, not a criticism. You just need to learn a lot more of the fundamentals before you tackle a project like this one. You'll just get frustrated, and the school will suffer.
You need to spend A LOT of time on your database design. Think about maintenance in the long run, it needs to be as easy as possible. The best way is to create a relational database, research bridge,validation, and base tables. To answer your question, I would not do a table for every year. The best way is having the student grade data mapped to a specific unique id representing that student's specific course ID.
I would think about creating a table for each of the nouns:
instructor - PK instructorID,instructorName..
(any other 1:1 instructor information)
Student - PK studentID,StudentName..
(any other 1:1 student information)
Course - PK CourseID, CourseName, CourseDescription..
(any other 1:1 course information)
•Teachers are assigned to different grades as their homeroom teacher.
on the 1:1 instructor table you could have a column called HomeroomGrade and then you
update that column with the current grade. If you wanted to keep a history of the grade
you could have the instructor table be a composite key with another column incrementing up
for the current record.
•Students move up a grade (or don't).
You will need another table showing the relationship of students to a unique courses
grades for that year, but first you need to map the instructor to that specific course.
PK InstructorToCourseID
InstructorID - FK
CourseID - FK
Year - FK
then yet another table mapping the unique course to that student with the grade..
PK InstructorToCourseID FK from previous table
PK StudentID - FK from student information table
Grade
Sorry if im general and vague, but this should give you some ideas on the relationships that can be created.
Assume a social application that has some million users & there are around 200-300 topics, Users can make posts which could be tagged on upto 5 topics. I have 2 kind of queries on this data:
find post by a certain user
find all recent posts tagged on a specific topic.
For 1st query I can easily create the schema using superColumns in the User Columnfamily(in this supercolumn, I can store the postIds of all posts by user as columns).
My question is how should I design the schema to serve 2nd query in Cassandra?
Although Justice's answer would work, I don't like it because it requires an OrderPreservingPartitioner to perform the range scan. OPP has a lot of problems associated with it. See the article that I've been linking to constantly for details.
Instead, I would recommend this:
topic|YYMMDDHH: {TimeUUID: postID, TimeUUID: postID, etc... }
where "topic|YYMMDDHH" is the row key, each column name is a TimeUUID, and the column values are postIDs.
To get the latest posts for any topic, you get a slice off the end of the most recent row for that topic. If that row didn't have enough columns, you go to the previous one in time, etc.
This has a few nice properties. First, if you don't care about really old posts on a topic, only relatively recent ones, you can purge old rows on a regular basis and save yourself some space; this could even be done with column TTLs so that you don't have to do any extra work. Second, your rows will be bounded in size because they are split every hour. Third, you don't need OPP :)
One downside to this is that if there's a really hot topic, one node may receive higher traffic than the others for an hour at a time.
For the second query, build a secondary-index column family whose keys are #{topic}:#{unix_timestamp}. Rows would have a single column with the post ID. You can then do a range scan.