Modeling Votes on GAE - google-app-engine

Modeling Votes on GAE - google-app-engine

I'm trying to determine the most efficient way to create a votable entity on GAE's datastore. I would like to show the user a control to vote for this entity or an icon indicating that they have already voted for it; ie, I'm asking "has a user voted on this entity?" Lets say that we have a Question entity that a user may up-vote. Here is what I'm thinking of doing:
Query for my Question entities. These questions already have a precalculated ranking on which I will sort.
Use a relation index entity that is a child of the Question entity. Query for all Questions using the same filters as #1 where my user is a member of this relation index entity.
Merge the results of #2 into #1 by setting a hasVoted property to true for each found set member.
This is the cleanest way I could think of doing it but it still requires two queries. I didn't to create duplicate Question entities for each user to own because it would cause too much data duplication. Is this solution a good way to handle what is effectively a join between a m2m relationship between Votes and Questions or am I thinking too relationally?

Instead of using a relation index, just have a child entity for each user that's voted on the question. Make the key_name of the child entity the ID of the user. Then, to determine if a user y has voted on a question with ID x, simply fetch the key (Question:x/Vote:y). You can batch this to fetch multiple entities for multiple questions or users, too.

I would take a look to Overheard Google App Engine sample application.
Our basic model will be to have
Quotes, which contain a string for the
quotation, and Votes, which contain
the user name and vote for a
particular user.
There's a Google article about it and here you can find the sources.

To avoid the second query, you could store all of the questions a user has voted on in a single entity. This entity could be part of the User model, or exist in a one-to-one relationship with User entities.
You could then load this information as needed (and store it to memcache to avoid datastore loads) so that you can quickly check if a user has already voted on a question (without doing a second query most of the time).
If a user may vote on a really large number of questions, then you may have to extend this idea. Here is an outline (not functionally complete) of how you might go about the simple scheme:
class UserVotes(db.Model):
# key = key_name or ID of the corresponding user entity
# if all of your question entities have IDs, then voted_on can be a list of
# integers; otherwise it can be a list of strings (key_name values)
voted_on = db.ListProperty(int, indexed=False)
# in your request handler ...
questions = ...
voted_on = memcache.get('voted-on:%s' % user_id)
if voted_on is None:
voted_on = UserVotes.get_by_id(user_id) # should do get_or_insert() instead
memcache.set(...)
for q in questions:
q.has_voted = q.key().id() in voted_on

Related

NoSql - entity holds an owner ID field vs owner holds list of child ID's

I am currently exploring MongoDB.
I built a notes web app and for now the DB has 2 collections: notes and users.
The user can create, read and update his notes.
I want to create a page called /my-notes that will display all the notes that belong to the connected user.
My question is:
Should the notes model has an ownerId field or the opposite - the user model will have a field of noteIds of type list.
Points I found relevant for the decision making:
noteIds approach:
There is no need to query the notes that hold the desired ownerId (say we have a lot of notes then we will need indexes and search accross the whole notes collection). We just need to find the user by user ID and then get all the notes by their IDs.
In this case there are 2 calls to DB.
The data is ordered by the order of insertion to the notesIds field in the document.
ownerId approach:
We do need to find the notes by their ownerId field across the notes collection which might be more computer "intensive".
We can paginate / sort the data as we want - more control over the data.
Are there any more points you can think of?
As I can conclude this is a question of whether you want less computer intensive DB calls vs more control over the data.
What are the "best practices"?
Thanks,

A similar use case is explained in the documentation. If there is no limit on number of notes a user can have, it might be better to store a userId reference field in notes document.
As you've figured out already, pagination would be easier in the second approach. Also when updating notes, you can simply updateOne({ _id: "note_id", userId: 1 }) instead of checking user's document if the note actually belong to the user.

Sharded ancestor entities in GAE

I'm working on a GAE-based project involving a large user base (possibly millions of users). We use Datastore for persistency. Users will be identified both by username and by e-mail address, so these two properties should be unique across all entities of the kind. Because Datastore doesn't support unique fields other than ID, we need transactions to ensure uniqueness of these fields when new users are registered. And in order to have transactions, User entities need to be enclosed in entity groups.
Having large entity groups is not recommended, as pointed out here. Therefore, given a possible large number of stored users, I'm thinking of putting them into multiple smaller entity groups. Each group would have a common parent with ID generated from the two unique fields (a piece of the MD5 sum for instance). Inserting a new user could look like this (in Python):
#ndb.transactional
def register_new_user(login, email, full_name) :
# validation code omitted
user = User(login = login, email = email, full_name = full_name)
group_id = a_simple_hash(login, email)
group_key = ndb.Key('UserGroup', group_id)
query = User.query(ancestor = group_key).filter(ndb.OR(User.login = login, User.email = email))
if not query.get() :
user.put()
One problem I see with this solution is that it will be impossible to get a User by ID alone. We'd have to use complete entity keys.
Are there any other cons of such approach? Anyone tried something similar?
EDIT
As I've been pointed out in comments, a hash like the one outlined above would not work properly because it would only prevent registering users having non-unique e-mails together with non-unique usernames matching those e-mails. It would work if the hash was computed based on a single field.
Nevertheless, I find the concept of such sharding interesting by itself and perhaps worth of discussion.

An e-mail address is owned by a user and unique. So there is a very small change, somebody will (try to) use the same email address.
So my approch would be: get_or_insert a new login, which makes it easy to login (by key) and next verify if the e-mail address is unique.
If it not unique you can discard or .....do something else
Entity groups have meaning for transactions. I'am interested in your planned transactions, because I do not understand your entity group key hash. Which entities will be part of the entity group, and why?
A user with the same login will be part of another entity group, If i do understand your hash?
It looks like your entity group holds a single entity.

In my opinion you're overthinking here : what's the probability of having two users register with the same username at the same time ?
Very slim. Eventual consistency is good enough for this case, as you don't nanosecond precision...
unless you plan to have more users than facebook, with people registering every second.
Registering with the same email is virtually impossible for different users, since the check has already been done by the email provider for you!
Only a user could try to open two accounts with the same email address. Eventual consistency is good enough for this query too.
Your user entities each belong to their own entity group.
Actually in most use cases, your User is the most obvious root entity : people use the datastore because they need scalability, and most of the time huge scale is needed for user oriented apps.

NDB Modeling One-to-one with KeyProperty

I'm quite new to ndb but I've already understood that I need to rewire a certain area in my brain to create models. I'm trying to create a simple model - just for the sake of understanding how to design an ndb database - with a one-to-one relationship: for instance, a user and his info. After searching around a lot - found documentation but it was hard to find different examples - and experimenting a bit (modeling and querying in a couple of different ways), this is the solution I found:
from google.appengine.ext import ndb
class Monster(ndb.Model):
name = ndb.StringProperty()
#classmethod
def get_by_name(cls, name):
return cls.query(cls.name == name).get()
def get_info(self):
return Info.query(Info.monster == self.key).get()
class Info(ndb.Model):
monster = ndb.KeyProperty(kind='Monster')
address = ndb.StringProperty()
a = Monster(name = "Dracula")
a.put()
b = Info(monster = a.key, address = "Transilvania")
b.put()
print Monster.get_by_name("Dracula").get_info().address
NDB doesn't accept joins, so the "join" we want has to be emulated using class methods and properties. With the above system I can easily reach a property in the second database (Info) through a unique property in the first (in this case "name" - suppose there are no two monsters with the same name).
However, if I want to print a list with 100 monster names and respective addresses, the second database (Info) will be hit 100 times.
Question: is there a better way to model this to increase performance?

If its truly a one to one relationship, why are creating 2 models. Given your example the Address entity cannot be shared with any Monster so why not put the Address details in the monster.
There are some reasons why you wouldn't.
Address could become large and therefore less efficient to retrieve 100's of properties when you only need a couple - though project queries may help there.
You change your mind and you want to see all monsters that live in Transylvania - in which case you would create the address entity and the Monster would have the key property that points to the Address. This obviously fails when you work out that some monsters can live in multiple places (Werewolfs - London, Transylvania, New York ;-) , in which case you either have a repeating KeyProperty in the monstor or an intermediate entity that points to the monster and the address. In your case I don't think that monsters on the whole have that many documented Addresses ;-)
Also if you are uniquely identifying monsters by name you should consider storing the name as part of the key. Doing a Monster.get_by_id("dracula") is quicker than a query by name.
As I wrote (poorly) in the comment. If 1. above holds and it is a true one to one relationship. I would then create Address as a child entity (Monster is the parent/ancestor in the key) when creating address. This allows you to,
allow other entities to point to the Address,
If you create a bunch of child entities, fetch them with a single
ancestor query). 3 If you have get monster and it's owned entities
again it's an ancestor query.
If you have a bunch of entities that
should only exist if Monster instance exists and they are not
children, then you have to do querys on all the entity types with
KeyProperty's matching the key, and if theses entities are not
PolyModels, then you have to perform a query for each entity
type (and know you need to perform the query on a given entity,
which involves a registry of some type, or hard coding things)

I suspect what you may be trying could be achieved by using elements described in the link below
Have a look at "Operations on Multiple Keys or Entities" "Expando Models" "Model Hooks"
https://developers.google.com/appengine/docs/python/ndb/entities
(This is probably more a comment than an answer)

App Engine Datastore: entity design and query optimization

I have a system where users can vote on entities, if they like or hate them. It will be bazillion votes and trazillion records, hopefully, some time in the future :)
At the moment i store a vote in an Entity like this:
UserRecordVote: recordId, userId, hateOrLike
And when i want to get every Record the user liked i do a query like this:
I query the "UserRecordVote" table for all the "likes", then i take the recordIds from that resultset, create a key of that property and get the record from the Record Table.
Then i aggregate all that in a list and return it.
Here's the question:
I came up with a different approach and i want to find out if that one is 1. faster and 2. how much is the difference in cost.
I would create an Entity which's name would be userId + "likes" and the key would be the record id:
new Entity(userId + "likes", recordId)
So when i would do a query to get all the likes i could simply query for all, no filters needed. AND i could just grab the entity key! which would be much cheaper if i remember the documentation of app engine right. (can't find the pricing page anymore). Then i could take the Iterable of keys and do a single get(Iterable keys). Ok so i guess this approach is faster and cheaper right? But what if i want to grab all the votes of a user or better said, i want to grab all the records a user didn't vote on yet.
Here's the real question:
I wan't to load all the records a user didn't vote on yet:
So i would have entities like this:
new Entity(userId+"likes", recordId);
and
new Entity(userId+"hates", recordId);
I would query both vote tables for all entity keys and query the record table for all entity keys. Then i would remove all the record entity keys matching one of the vote entity keys and with the result i would get(Iterable keys) the full entities and have all the record entites which are not in one of the two voting tables.
Is that a useful approach? Is that the fastest and cost efficient way to do a datastore query? Am i totally wrong and i should store the information as list properties?
EDIT:
With that approach i would have 2 entity groups for each user, which would result in million different entity groups, how would GAE Datastore handle that? Atleast the Datastore Viewer entity select box would probably crash :) ?

To answer the Real Question, you probably want to have your hateOrLike field store an integer that indicates either hated/liked/notvoted. Then you can filter on hateOrLike=notVoted.
The other solutions you propose with the dynamically named entities make it impossible to query on other aspects of your entities, since you don't know their names.
The other thing is you expect this to be huge, you likely want to keep a running counter of your votes rather than tabulating every time you pull up a UserRecord - querying all the votes, and then calculating them on each view is very slow - especially since App Engine will only return 1000 results on each query, and if you have more than 1000 votes, you'll have to keep making repeated queries to get all the results.
If you think people will vote quickly, you should look into using a sharded counter for performance. There's examples of that with code available if you do a google search.

Consider serializing user hate/like votes in two separate TextProperties inside the entity. Use the userId as key_name.
rec = UserRecordVote.get_by_key_name(userId)
hates = len(rec.hates.split('_'))
etc.

Google Appengine: Is This a Good set of Entity Groups?

I am trying to wrap my head around Entity Groups in Google AppEngine. I understand them in general, but since it sounds like you can not change the relationships once the object is created AND I have a big data migration to do, I want to try to get it right the first time.
I am making an Art site where members can sign up as regular a regular Member or as one of a handful of non-polymorphic Entity "types" (Artist, Venue, Organization, ArtistRepresentative, etc). Artists, for example can have Artwork, which can in turn have other Relationships (Gallery, Media, etc). All these things are connected via References and I understand that you don't need Entity Groups to merely do References. However, some of the References NEED to exist, which is why I am looking at Entity Groups.
From the docs:
"A good rule of thumb for entity groups is that they should be about the size of a single user's worth of data or smaller."
That said, I have a couple hopefully yes/no questions.
Question 0: I gather you don't need Entity Groups just to do transactions. However, since Entity Groups are stored in the same region of Big Table, this helps cut down on consistency issues and race conditions. Is this a fair look at Entity Groups and Transactions together?
Question 1: When a child Entity is saved, do any parent objects get implicitly accessed/saved? i.e. If I set up an Entity Group with path Member/Artist/Artwork, if I save an Artwork object, do the Member and Artist objects get updated/accessed? I would think not, but I am just making sure.
Question 2: If the answer to Question 1 is yes, does the accessing/updating only travel up the path and not affect other children. i.e. If I update Artwork, no other Artwork child of Member is updated.
Question 3: Assuming it is very important that the Member and its associated account type entity exist when a user signs up and that only the user will be updating its Member and associated account type Entity, does it make sense to put these in Entity Groups together?
i.e. Member/Artist, Member/Organization, Member/Venue.
Similarly, assuming only the user will be able to update the Artwork entities, does it make sense to include those as well? Note: Media/Gallery/etc which are references to Artwork may be related to lots of Artwork, not just those owned by the user (i.e. many to many relations).
It makes sense to have all the user's bits in an entity group if it works the way I suspect (i.e. Q1/Q2 are "no"), since they will all be in the same region of BigTable. However, adding the Artwork to the entity group seems like it might violate the "keep it small" principal and honestly, may not need to be in Transactions aside from saving bandwidth/retrys when users are uploading artwork images.
Any thoughts? Am I approaching Entity Groups wrong?

0: You do need entity groups for transactions among multiple entities
1: Modifying/accessing children does not modify/access a parent
2: N/A
3: Sounds reasonable. My feeling is, entity groups should not be used unless you need transactions among them.
It is not necessary to have the the Artwork as a child for permission purposes. But if you need transactional modification to them (including e.g. creation and deletion) it might be better. For example: if you delete an account, you delete the user entity but before you delete the child, you get DeadlineExceeded or the server crashes. Now you have an orphaned Artwork. If you have more than 1,000 Artworks for an Artist, you must delete in batches.
Good luck!