I have two models: Members and Events. A member can participe in many events and in an event, there are many participants. I think about like this:
class Members(ndb.model):
event = ndb.KeyProperty(repeated=True)
class Events(ndb.model):
member = ndb.KeyProperty(repeated=True)
What's the best way to do many-to-many relationship?
I think in this case you want to have an array/list of keys in one model pointing to the other model. Since there is a limit on the length of the array/list (the max is 5000 right now) you probably want to have the member model point to the events models (I am assuming members probably don't go to many events (more than 5000), but there are many (more than 5000) members at an event). Now if you want the list of attendees you just query for the event key in the members models. You can do a key only query if you just want the head count, this will save on cost.
If for some reason you need to have both cases be over 5000, then you can make another (attendee) model that contains a member key and an event key, then make this attendee model for each guest for each event. To get the list of guests just query this model for a specific event, if you need a list of events for a member do the opposite. This isn't much more work than the above so might be your go to way of doing it even if 5000 isn't a limiting factor.
Related
I am looking for advice on the best way to go about modeling my database
Lets say I have three entities: meetings, habits, and tasks. Each has its own unique schema, however, I would like all 3 to have several things in common.
They should all contain calendar information, such as a start_date, end_date, recurrence_pattern, etc...
There are a few ways I could go about this:
Add these fields to each of the entities
Create an Event entity and have a foreign_key field on each of the other entities, pointing to the related Event
Create an Event entity and have 3 foreign_key fields on the Event (one for each of the other entities). At any given time only 1 of those fields would have a value and the other 2 would be null
Create an Event entity with 2 fields related_type and related_id. the related_type value, for any given row, would be one of "meetings", "habits", or "tasks" and the related_id would be the actual id of that entity type.
I will have separate api endpoints that access meetings, habits, and tasks.
I will need to return the event data along with them.
I will also have an endpoint to return all events.
I will need to return the related entity data along with each event.
Option 4 seems to be the most flexible but eliminates working with foreign keys.
Im not sure if that is a problem or a hinders performance.
I say its flexible in the case that I add a new entity, lets call it "games", the event schema will already be able to handle this.
When creating a new game, I would create a new event, and set the related_type to "games".
Im thinking the events endpoint can join on the related_type and would also require little to no updating.
Additionally, this seems better than option 3 in the case that I add many new entities that have event data.
For each of these entities a new column would be added to the event.
Options 1 and 2 could work fine, however I cannot just query for all events, I would have to query for each of the other entities.
Is there any best practices around this scenario? Any other approaches?
In the end performance is more important then flexibility. I would rather update code than sacrifice on performance.
I am using django and maybe someone has some tips around this, however, I am really looking for best practices around the database itself and not the api implementation.
I would keep it simple and choose option 1. Splitting up data in more tables than necessary for proper normalization won't be a benefit.
Perhaps you will like the idea of using PostgreSQL table inheritance. You could have an empty table event, and your three tables inherit from that table. That way, they automatically have all column from event, but they are still three independent tables.
What I have: A Postgres database with TypeORM.
What is the model: A main entity Event and many children inherited (10+). Each children will have attributes very different from each other. Each Event belong to an User, left out of the example as it is not important for this particular case. For a simplified example, I've attached a mock diagram.
How do I use it: The model will be queried for the list of all aggregated events, in chronological order, and then filtered by its "type" in order to be displayed (Parameter, Symptom, ...). This means that I want a list with a subset of the most recent 20 Events (as example). For each of those 20 I will get each individual data from the table (or embed it in the first place).
Some of the events will happen with an high frequency while others with a much lower frequency.
The question is, what would be the best approach to model this?
I came up with:
Single table, Event contains everything
Cons: it will contains a lot of NULLs
Pros: everything can be aggregated through that table, no joins or views needed
Multiple tables, one for each type
Cons: I will need to have lots of joins and views to aggregate the data
Pros: Each table and entry is meaningful of its data and type
Multiple tables plus a main Event which points to the types row
Cons: Almost same as above but with an easier way to get certain aggregation per User
Pros: Same as above
It looks like you have a straight-forward tree structure rooted at the User entity. Something like this:
You then just need a table for each entity:
Users
Events
The Children (Parameter, Symptom, etc.)
The Events have a user field, that contains the ID of their parent User. Similarly, the Children have an event field that contains the ID of their parent Event.
Such that:
"Child" {
event -> id of parent Event
}
Event {
user -> id of parent User
}
In TS, you can create Entities for each type of Child (Symptom, Meal, etc.) so they remain strictly-typed. Having a seperate table for each allows you to optimize their DB representations however you want (indexing specific fields, normalizing, etc.). Seperate tables also allows you to avoid having a lot of NULLs for uncommon values.
I am a beginner to Datastore and I am wondering how I should use it to achieve what I want to do.
For example, my app needs to keep track of customers and all their purchases.
Coming from relational database, I can achieve this by creating [Customers] and [Purchases] table.
In Datastore, I can make [Customers] and [Purchases] kinds.
Where I am struggling is the structure of the [Purchases] kind.
If I make [Purchases] as the child of [Customers] kind, would there be one entity in [Customers] and one entity in [Purchases] that share the same key? Does this mean inside of this [Purchases] entity, I would have a property that just keeps increasing for each purchase they make?
Or would I have one [Purchases] entity for each purchase they make and in each of these entities I would have a property that points to a entity in [Customers] kind?
How does Datastore perform in these scenarios?
Sounds like you don't fully understand ancestors. Let's go with the non-ancestor version first, which is a legitimate way to go:
class Customer(ndb.Model):
# customer data fields
name = ndb.StringProperty()
class Purchase(ndb.Model):
customer = ndb.KeyProperty(kind=Customer)
# purchase data fields
price = ndb.IntegerProperty
This is the basic way to go. You'll have one entity in the datastore for each customer. You'll have one entity in the datastore for each purchase, with a keyproperty that points to the customer.
IF you have a purchase, and need to find the associated customer, it's right there.
purchase_entity.customer.get()
If you have a Customer, you can issue a query to find all the purchases that belong to the customer:
Purchase.query(customer=customer_entity.key).fetch()
In this case, whenever you write either a customer or purchase entity, the GAE datastore will write that entity any one of the datastore machines running in the cloud that's not busy. You can have really high write throughput this way. However, when you query for all the purchases of a given customer, you just read back the most current data in the indexes. If a new purchase was added, but the indexes not updated yet, then you may get stale data (eventual consistency). You're stuck with this behavior unless you use ancestors.
Now as for the ancestor version. The basic concept is essentially the same. You still have a customer entity, and separate entities for each purchase. The purchase is NOT part of the customer entity. However, when you create a purchase using a customer as an ancestor, it (roughly) means that the purchase is stored on the same machine in the datastore that the customer entity was stored on. In this case, your write performance is limited to the performance of that one machine, and is advertised as one write per second. As a benefit though, you can can query that machine using an ancestor query and get an up-to-date list of all the purchases of a given customer.
The syntax for using ancestors is a bit different. The customer part is the same. However, when you create purchases, you'd create it as:
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
This example creates two separate purchase entities. Each purchase will have a different key, and the customer has its own key as well. However, each purchase key will have the customer_entity's key embedded in it. So you can think of the purchase key being twice as long. However, you don't need to keep a separate KeyProperty() for the customer anymore, since you can find it in the purchases key.
class Purchase(ndb.Model):
# you don't need a KeyProperty for the customer anymore
# purchase data fields
price = ndb.IntegerProperty
purchase.key.parent().get()
And in order to query for all the purchases of a given customer:
Purchase.query(ancestor=customer_entity.key).fetch()
The actual of structure of the entities don't change much, mostly the syntax. But the ancestor queries are fully consistent.
The third option that you kinda describe is not recommended. I'm just including it for completeness. It's a bit confusing, and would go something like this:
class Purchase(ndb.Model):
# purchase data fields
price = ndb.IntegerProperty()
class Customer(ndb.Model):
purchases = ndb.StructuredProperty(Purchase, repeated=True)
This is a special case which uses ndb.StructuredProperty. In this case, you will only have a single Customer entity in the datastore. While there's a class for purchases, your purchases won't get stored as separate entities - they'll just be stored as data within the Customer entity.
There may be a couple of reasons to do this. You're only dealing with one entity, so your data fetch will be fully-consistent. You also have reduced write costs when you have to update a bunch of purchases, since you're only writing a single entity. And you can still query on the properties of the Purchase class. However, this was designed for only having a limited number or repeated objects, not hundreds or thousands. And each entity is limited to ta total size of 1MB, so you'll eventually hit that and you won't be able to add more purchases.
(from your personal tags I assume you are a java guy, using GAE+java)
First, don't use the ancestor relationships - this has a special purpose to define the transaction scope (aka Entity Groups). It comes with several limitations and should not be used for normal relationships between entities.
Second, do use an ORM instead of low-level API: my personal favourite is objectify. GAE also offers JDO or JPA.
In GAE relations between entities are simply created by storing a reference (a Key) to an entity inside another entity.
In your case there are two possibilities to create one-to-many relationship between Customer and it's Purchases.
public class Customer {
#Id
public Long customerId; // 'Long' identifiers are autogenerated
// first option: parent-to-children references
public List<Key<Purchase>> purchases; // one-to-many parent-to-child
}
public class Purchase {
#Id
public Long purchaseId;
// option two: child-to-parent reference
public Key<Customer> customer;
}
Whether you use option 1 or option 2 (or both) depends on how you plane to access the data. The difference is whether you use get or query. The difference between two is in cost and speed, get being always faster and cheaper.
Note: references in GAE Datastore are manual, there is no referential integrity: deleting one part of a relationship will produce no warning/error from Datastore. When you remove entities it's up to your code to fix references - use transactions to update two entities consistently (hint: no need to use Entity Groups - to update two entities in a transaction you can use XG transactions, enabled by default in objectify).
I think the best approach in this specific case would be to use a parent structure.
class Customer(ndb.Model):
pass
class Purchase(ndb.Model):
pass
customer = Customer()
customer_key = customer.put()
purchase = Purchase(parent=customer_key)
You could then get all purchases of a customer using
purchases = Purchase.query(ancestor=customer_key)
or get the customer who bough the purchase using
customer = purchase.key.parent().get()
It might be a good idea to keep track of the purchase count indeed when you use that value a lot.
You could do that using a _pre_put_hook or _post_put_hook
class Customer(ndb.Model):
count = ndb.IntegerProperty()
class Purchase(ndb.Model):
def _post_put_hook(self):
# TODO check whether this is a new entity.
customer = self.key.parent().get()
customer.count += 1
customer.put()
It would also be good practice to do this action in a transacion, so the count is reset when putting the purchase fails and the other way around.
#ndb.transactional
def save_purchase(purchase):
purchase.put()
I want to run over this plan I have for achieving strong consistency with my GAE structure. Currently, here's what I have (it's really simple, I promise):
You have a Class (Class meaning Classroom not a programming "class") model and an Assignment model and also a User model. Now, a Class has an integer list property called memberIds, which is an indexed list of User ids. A class also has a string list of Assignment ids.
Anytime a new Assignment is created, its respective Class entity is also updated and adds the new Assignment id to its list.
What I want to do is get new Assignments for a user. What I do is query for all Classes where memberId = currentUserId. Each Class I get back has a list of assignment ids. I use those ids to get their respective Assignments by key. After months with this data model, I just realized that I might not get strong consistency with this (for the Class query part).
If user A posts an assignment (which consequently updates ClassA), user B who checks in for new assignments a fraction of a second later might not yet see the updated changes to ClassA (right?).
This is undesired. One solution would be to user ancestor queries, but that is not possible in my case, and entity groups are limited to 1 write per second, and I'm not sure if that's enough for my case.
So here's what I figured: anytime a new assignment is posted, we do this:
Get the respective Class entity
add the assignment id to the Class
get the ids of all this Class's members
fetch all users who are members, by keys
a User entity has a list of Classes that the user is a member of. (A LocalStructuredProperty, sort of like a dictionary:{"classId" : "242", "hasNewAssignment" : "Yes"} ). We flag that class as hasNewAssignment =
YES
Now when a user wants to get new assignments, instead of querying for groups that have a new assignment and which I am a member of, I
check the User objects list of Classes and check which classes have
new assignments.
I retrieve those classes by key (strongly consistent so far, right?)
I check the assignments list of the classes and I retrieve all assignments by key.
So throughout this process, I've never queried. All results should be strongly consistent, right? Is this a good solution? Am I overcomplicating things? Are my read/write costs skyrocketing with this? What do you think?
Queries are not strongly consistent. Gets are strongly consistent.
I think you did the right thing:
Your access is strongly consistent.
Your reads will be cheaper: one get is half cheaper as then query that returns one entity.
You writes will be more expensive: you also need to update all User entities.
So the cost depends on your usage pattern: how many assignment reads do you have vs new assignment creation.
I think using ancestor queries is a better solution.
Set the ancestor of the Assignment entities as the Class to which the assignment is allotted
Set the ancestor of a Student entity as the Class to which the student belongs.
This way all the assignments and students in a particular class belong to the same entity group. So Strong consistency is guaranteed w.r.t a query that has to deal with only a single class.
N.B. I am assuming that not too many people won't be posting assignments into a class at the same time. (but any number of people can post assignments into different classes, as they belong to different entity groups)
I am working on a system, which will run on GAE, which will have several related entities and I am not sure of the best way to store the data. This post is a request for advice from others who may have similar experience....
The system will have users, with profile data and an image. Those users will be able to create "events" and add journal entries to it. For the purpose of the system, the "events" will likely have 1 or 2 journal entries in them, and anything over 10 would likely never happen. Other users will be able to add comments to users' entries as well, where popular ones may have hundreds or even thousands of comments. When a random visitor uses the system, they should be able to see the latest events (latest, being defined by those with latest journal entries in them), search by tag, and a very perform basic text search. Then upon selecting an event to view, it should be displayed with all journal entries, and all user comments, with user images alongside comments. A user should also have a kind of self-admin page, to view/modify/delete their events and to view/modify/delete comments they have made on other events. So, doing all this on a normal RDBMS would just queries with some big joins across several tables. On GAE it would obviously need to work differently. Here are my initial thoughts on the design of the entities:
Event entity - id, name, timstamp, list
property of tags, view count,
creator's username, creator's profile
image id, number of journal entries
it contains, number of total comments
it contains, timestamp of last update to contained journal entries, list property of index words for search (built/updated from text from contained journal entries)
JournalEntry entity - timestamp,
journal text, name of event,
creator's username, creator's profile
image id, list property of comments
(containing commenter username and
image id)
User entity - username, password hash, email, list property of subscribed events, timestamp of create date, image id, number of comments posted, number of events created, number of journal entries created, timestamp of last journal activity
UserComment entity - username, id of event commented on, title of event commented on
TagData entity - tag name, count of events with tag on them
So, I'd like to hear what people here think about the design and what changes should be made to help it scale well. Thanks!
Rather than store Event.id as a property, use the id automatically embedded in each entity's key, or set unique key names on entities as you create them.
You have lots of options for modeling the relationship between Event and JournalEntry: you could use a ReferenceProperty, you could parent JournalEntries to Events and retrieve them with ancestor queries, or you could store a list of JournalEntry key ids or names on Event and retrieve them in batch with a key query. Try some things out with realistically-distributed dummy data, and use appstats to see what works best.
UserComment references an Event, while JournalEntry references a list of UserComments, which is a little confusing. Is there a relationship between UserComment and JournalEntry? or just between UserComment and Event?
Persisting so many counts is expensive. When I post a comment, you're going to write a new UserComment entity and also update my User entity and a JournalEntry entity and an Event entity. The number of UserComments you expect per Event makes it unwise to include everything in the same entity group, which means you can't do these writes transactionally, so you'll do them serially, and the entities might be stored across different network nodes, making the whole operation slow; and you'll also be open to consistency problems. Can you do without some of these counts and consider storing others in memcache?
When you fetch an Event from the datastore, you don't actually care about its list of search index words, and retrieving and deserializing them from protocol buffers has a cost. You can get around this by splitting each Event's search index words into a separate child EventIndex entity. Then you can query EventIndex on your search term, fetch just the EventIndex keys for EventIndexes that match your search, derive the corresponding Events' keys with key.parent(), and fetch the Events by key, never paying for the retrieval or deserialization of your search index word lists. Brett Slatkin explains this strategy here at 14:35.
Updating Event.viewCount will fail if you have a lot of views for any Event in rapid succession, so you should try out counter sharding.
Good luck, and tell us what you learn by trying stuff out.