I am working on a system, which will run on GAE, which will have several related entities and I am not sure of the best way to store the data. This post is a request for advice from others who may have similar experience....
The system will have users, with profile data and an image. Those users will be able to create "events" and add journal entries to it. For the purpose of the system, the "events" will likely have 1 or 2 journal entries in them, and anything over 10 would likely never happen. Other users will be able to add comments to users' entries as well, where popular ones may have hundreds or even thousands of comments. When a random visitor uses the system, they should be able to see the latest events (latest, being defined by those with latest journal entries in them), search by tag, and a very perform basic text search. Then upon selecting an event to view, it should be displayed with all journal entries, and all user comments, with user images alongside comments. A user should also have a kind of self-admin page, to view/modify/delete their events and to view/modify/delete comments they have made on other events. So, doing all this on a normal RDBMS would just queries with some big joins across several tables. On GAE it would obviously need to work differently. Here are my initial thoughts on the design of the entities:
Event entity - id, name, timstamp, list
property of tags, view count,
creator's username, creator's profile
image id, number of journal entries
it contains, number of total comments
it contains, timestamp of last update to contained journal entries, list property of index words for search (built/updated from text from contained journal entries)
JournalEntry entity - timestamp,
journal text, name of event,
creator's username, creator's profile
image id, list property of comments
(containing commenter username and
image id)
User entity - username, password hash, email, list property of subscribed events, timestamp of create date, image id, number of comments posted, number of events created, number of journal entries created, timestamp of last journal activity
UserComment entity - username, id of event commented on, title of event commented on
TagData entity - tag name, count of events with tag on them
So, I'd like to hear what people here think about the design and what changes should be made to help it scale well. Thanks!
Rather than store Event.id as a property, use the id automatically embedded in each entity's key, or set unique key names on entities as you create them.
You have lots of options for modeling the relationship between Event and JournalEntry: you could use a ReferenceProperty, you could parent JournalEntries to Events and retrieve them with ancestor queries, or you could store a list of JournalEntry key ids or names on Event and retrieve them in batch with a key query. Try some things out with realistically-distributed dummy data, and use appstats to see what works best.
UserComment references an Event, while JournalEntry references a list of UserComments, which is a little confusing. Is there a relationship between UserComment and JournalEntry? or just between UserComment and Event?
Persisting so many counts is expensive. When I post a comment, you're going to write a new UserComment entity and also update my User entity and a JournalEntry entity and an Event entity. The number of UserComments you expect per Event makes it unwise to include everything in the same entity group, which means you can't do these writes transactionally, so you'll do them serially, and the entities might be stored across different network nodes, making the whole operation slow; and you'll also be open to consistency problems. Can you do without some of these counts and consider storing others in memcache?
When you fetch an Event from the datastore, you don't actually care about its list of search index words, and retrieving and deserializing them from protocol buffers has a cost. You can get around this by splitting each Event's search index words into a separate child EventIndex entity. Then you can query EventIndex on your search term, fetch just the EventIndex keys for EventIndexes that match your search, derive the corresponding Events' keys with key.parent(), and fetch the Events by key, never paying for the retrieval or deserialization of your search index word lists. Brett Slatkin explains this strategy here at 14:35.
Updating Event.viewCount will fail if you have a lot of views for any Event in rapid succession, so you should try out counter sharding.
Good luck, and tell us what you learn by trying stuff out.
Related
I am currently exploring MongoDB.
I built a notes web app and for now the DB has 2 collections: notes and users.
The user can create, read and update his notes.
I want to create a page called /my-notes that will display all the notes that belong to the connected user.
My question is:
Should the notes model has an ownerId field or the opposite - the user model will have a field of noteIds of type list.
Points I found relevant for the decision making:
noteIds approach:
There is no need to query the notes that hold the desired ownerId (say we have a lot of notes then we will need indexes and search accross the whole notes collection). We just need to find the user by user ID and then get all the notes by their IDs.
In this case there are 2 calls to DB.
The data is ordered by the order of insertion to the notesIds field in the document.
ownerId approach:
We do need to find the notes by their ownerId field across the notes collection which might be more computer "intensive".
We can paginate / sort the data as we want - more control over the data.
Are there any more points you can think of?
As I can conclude this is a question of whether you want less computer intensive DB calls vs more control over the data.
What are the "best practices"?
Thanks,
A similar use case is explained in the documentation. If there is no limit on number of notes a user can have, it might be better to store a userId reference field in notes document.
As you've figured out already, pagination would be easier in the second approach. Also when updating notes, you can simply updateOne({ _id: "note_id", userId: 1 }) instead of checking user's document if the note actually belong to the user.
I am new to DynamoDB and after reading several docs, there is a scenario in which I am not sure which would be the best approach for designing a table.
Consider that we have some JobOffers and we should support the following data access:
get a specific JobOffer
get all JobOffers from a specific Company sorted by different criteria (newest/oldest/wage)
get JobOffers from a specific Company filtered by a specific city sorted by different criteria (newest/oldest/wage)
get all JobOffers (regardless of any Company !!!) sorted by different criteria (newest/oldest/wage)
get JobOffers (regardless of any Company !!!) filtered by a specific city sorted by different criteria (newest/oldest/wage)
Since we need to support sorting, my understanding is that we should use Query instead of Scan. In order to use Query, we need to use a primary key. Because we need to support a search like "get all JobOffers without any filters sorted somehow", which would be a good candidate for partition key?
As a workaround, I was thinking to use a new attribute "country" which can be used as the partition key, but since all JobOffers are specified in one country, all these items fall under the same partition, so it might be a bit redundant until we will have support for JobOffers from different countries.
Any suggestion on how to design JobOffer table (with PK and GSI/LSI) for such a scenario?
Design of a Dynamodb table is best done with an Access approach - that is - how are you going to be accessing the data in here. You have information X, you need Y.
Also remember that a dynamo is NOT an sql, and it is not restricted that every item has to be the same - consider each entry a document, with its PK/SK as directory/item structure in a file system almost.
So for example:
You have user data. You have things like : Avatar data (image name, image size, image type) Login data (salt/pepper hashes, email address, username), Post history (post title, identifier, content, replies). Each user will only have 1 Avatar item and 1 Login item, but have many Post items
You know that from the user page you are always going to have the user ID. 100% of the time. This should then be your PK - your Hash Key, PartitionKey. Then you have the rest of the things you need inform your sort key/range key.
PK
USER#123456
SK:
AVATAR - Attributes: (image name, image size, image type)
PK
USER#123456
SK:
LOGIN - Attributes: (salt/pepper hashes, email address, username)
PK
USER#123456
SK:
POST#123 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#125 - Attributes: (post title, identifier, content, replies)
PK
USER#123456
SK:
POST#193 - Attributes: (post title, identifier, content, replies)
This way you can do a query with the User ID and get ALL the user data back. Or if you are on a page that just displays their post history, you can do a query against User ID # SK Begins With POST and get back all t heir posts.
You can put in an inverted index (SK -> PK and vice versa) and then do a query on POST#193 and get back the user ID. Or if you have other PK types with POST#193 as the SK, you get more information there (like a REPLIES#193 PK or something)
The idea here is that you have X bits of information, and you need to craft your dynamo to be able to retrieve as much as possible with just that information, and using prefix's on your SKs you can then narrow the fields a little.
Note!
Sometimes this means a duplication of information! That you may have the same information under two sets of keys. This is ok and kind of expected when you start getting into really complex relationships. You can mitigate it somewhat with index's, but you should aim to avoid them where possible as they do introduce a bit of lag in terms of data propagation (its tiny, but it can add up)
So you have your list of things you want to get for your dynamo. What will you always have to tie them together? What piece of data do you have that will work?
You can do the first 3 with a company PK identifier and a reverse index. That will let you look up and get all a companies jobs, or using the reverse index all a specific job. Or if you can always know the company when looking up a specific job, then it uses the general first index.
Company# - Job# - data data data
You then do the sorting on your own, OR you add some sort of sort valuye to the Job# key - Sort Keys are inherently sorted after all. Company# - Job#1234#UNITED_STATES
of course this will only work for one sort at a time. You can make more than one index, but again - data sync lag is a real possibility.
But how to do this regardless of Company? Well you can have another index with your searchable attribute (Country for example) as the PK then you can query that.
Or do you have another set of data that can tie this all together? Do you have another thing that can reach it all?
If not, you may just have two items in your dynamo:
Company#1234 - Job#321 - details
Company#1234 - Country#United_states - job#321, job#456, job#1234
Company#1234 - Country#England - job#992, job#123, job#19231
your reverse index here would apply - you could do a query on PK: Contry#UnitedStates and you'd get back:
Country#United_states - Company#1234 - job#321, job #456, job31234
Country#United_states - Company#4556
Country#United_States - Comapny#8322
this isnt a relational database however! So either you have to do one of two things - use t hose job#s to then query that company and get the filter the jobs by what you want (bad - trying to avoid multiple queries!) or each job# is an attribute on country sk's, and it contains a copy of that relevant data in a map format {job title, job#, country, company, salary}. Then when they click on that job to go to the details, it makes a direct call straight to the job query, gets the details to display,and its good.
Again, it all comes down to access patterns. What do you have, and how can you arrange it in a way that lets you get what you need fast
I am making a web application in which users can post and read articles. I want to show list of articles to a user (which is easy to do) but I want to show only those articles which a user has not read (the article gets marked as read when user opens it).
What type of database I should use to maintain such relationship considering that there could be 1000s of articles and 1000s of users. Considering the traditional RDBMS, say there are two separate tables, one for user (user_id) and another for articles (article_id),
I can't add user_ids against each article_id as an article could be ready by 100s or 1000s of users.
a user could have read 10s or 100s of articles. I can't add article_id for each user to keep track of which article a user has read
In my opinion, both the above approaches could slow/complicate the process of fetching articles
Use an RDBMS unless you have a reason not to.
I can't add article_id for each user to keep track of which article a user has read
Yes, you absolutely can, and that's what you should do; a table like (pseudo-SQL)
CREATE TABLE user_read_articles (
user_id,
article_id
)
with an unique index over (user_id, article_id) is exactly the thing here.
You can also add extra data such as a timestamp for when the user read the article should you need to.
We've built an algorithm that helps us deliver relevant articles to our users. In the background during certain intervals, the algorithm will calculate metadata, such as average age, age spread, and gender coefficient from a slew of data related to views, comments, and votes.
With that said, are there any downsides to storing this metadata as fields on the Articles table? Or, should I create a separate table, such as Article_Data, to store the information? I am just not sure how much the updating of this metadata will interfere with selecting the articles.
For the most part, we will be SELECTing articles and its metadata and JOINing it on user data (age, gender, etc) to show users relevant content. The only time we don't need the metadata is when we show a particular article to a user.
If the fields are clearly defined, and there are a limited number of them, put them in the Articles table.
If you are going to store more than one record of metadata fields per article, you need another table, in a one-to-many relationship with the Articles table.
If the fields are not clearly defined, user-defined, or there are many of them, you probably need a new table with one row per metadata item. But this is more difficult to work with in the long run.
See Also
http://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model
I'm currently in the planning phase of building a scheduling web app (for volunteer staffing of events), and I've got a question for those with more experience.
Background:
There's a calendar of events, and any user at any time can register for any of the events. At a later time, but before that event, one of the admins will step in and select a "Staff List" out of those that registered, and the rest will be put into an "Alternate List".
What I've been thinking so far is that there will be an Event table, a User table, and then three others:
UserEvent
Maps users to events they registered to. Does not imply either the Staff nor the Alt list membership.
UserStaff
Maps users to events they registered to, and also happen to be staffing.
UserAlt
Similar to UserStaff
The question then becomes two part:
Is this a good way to do it?
Should each of those three associative tables have the user id and the event id?
That second question is really the one I'd like to see discussed. That seems like a lot of duplicated material (everything in either UserStaff or UserAlt will always be in UserEvent), so I was thinking of creating a unique key for the UserEvent table, in addition to the composite key, that the other tables (UserStaff and UserAlt) will refer to. On the plus side, there is less duplicated content, on the down side there's an intermediary table (UserEvent) that needs to be referenced in almost every query this way.
Hopefully I've been clear enough, and thanks in advance.
I would have the following tables:
User (UserID, firstname, lastname, etc.)
Event (EventID, Name, Date, Location, Capacity, etc.)
EventRegistration (EventRegistrationID, UserID, EventID, ParticipantTypeID, etc.)
ParticipantType (ParticipantTypeID, Name)
ParticipantType.Name is one of "participant" or "staff".
This seems good, although you might want to consider combining your User - Event association tables into one, and having a column on that table that indicates the purpose of the association, i.e. Event, Staff, or Alt. This would effectively obviate the need for the duplication you describe in the UserEvent tables, since Staff and Alt could be considered to be supersets of Event for most purposes.
One benefit of this approach is that it allows for there to be multiple types of User - Event associations, such as if you have a User who is a Staffer for an Event but not a Participant, or a User who is just an Alt; this approach saves you from having to enumerate all the possible combinations. Now, if your design explicitly specifies that you can only have a certain set of User Participation types, this might introduce a level of dissociation you don't want; you may prefer to have explicit constraints on the set of participation levels that a User may have on an Event. If you don't have that tightly specified set, on the other hand, this system allows for adding more Participation roles easily (and without disturbing existing Participation roles).
Not a direct answer to your question, but here's a site I like. It's got tons (and tons) of sample schema. I generally don't use it as definitive (of course), but sometimes it will give me an idea on something I wasn't thinking of.