Sharded ancestor entities in GAE - google-app-engine

I'm working on a GAE-based project involving a large user base (possibly millions of users). We use Datastore for persistency. Users will be identified both by username and by e-mail address, so these two properties should be unique across all entities of the kind. Because Datastore doesn't support unique fields other than ID, we need transactions to ensure uniqueness of these fields when new users are registered. And in order to have transactions, User entities need to be enclosed in entity groups.
Having large entity groups is not recommended, as pointed out here. Therefore, given a possible large number of stored users, I'm thinking of putting them into multiple smaller entity groups. Each group would have a common parent with ID generated from the two unique fields (a piece of the MD5 sum for instance). Inserting a new user could look like this (in Python):
#ndb.transactional
def register_new_user(login, email, full_name) :
# validation code omitted
user = User(login = login, email = email, full_name = full_name)
group_id = a_simple_hash(login, email)
group_key = ndb.Key('UserGroup', group_id)
query = User.query(ancestor = group_key).filter(ndb.OR(User.login = login, User.email = email))
if not query.get() :
user.put()
One problem I see with this solution is that it will be impossible to get a User by ID alone. We'd have to use complete entity keys.
Are there any other cons of such approach? Anyone tried something similar?
EDIT
As I've been pointed out in comments, a hash like the one outlined above would not work properly because it would only prevent registering users having non-unique e-mails together with non-unique usernames matching those e-mails. It would work if the hash was computed based on a single field.
Nevertheless, I find the concept of such sharding interesting by itself and perhaps worth of discussion.

An e-mail address is owned by a user and unique. So there is a very small change, somebody will (try to) use the same email address.
So my approch would be: get_or_insert a new login, which makes it easy to login (by key) and next verify if the e-mail address is unique.
If it not unique you can discard or .....do something else
Entity groups have meaning for transactions. I'am interested in your planned transactions, because I do not understand your entity group key hash. Which entities will be part of the entity group, and why?
A user with the same login will be part of another entity group, If i do understand your hash?
It looks like your entity group holds a single entity.

In my opinion you're overthinking here : what's the probability of having two users register with the same username at the same time ?
Very slim. Eventual consistency is good enough for this case, as you don't nanosecond precision...
unless you plan to have more users than facebook, with people registering every second.
Registering with the same email is virtually impossible for different users, since the check has already been done by the email provider for you!
Only a user could try to open two accounts with the same email address. Eventual consistency is good enough for this query too.
Your user entities each belong to their own entity group.
Actually in most use cases, your User is the most obvious root entity : people use the datastore because they need scalability, and most of the time huge scale is needed for user oriented apps.

Related

Separating collections in mongodb database that share a 1-to-1 relationship

I am using mongodb as the database for a project I've been working on and in the database I have a "user" collection and an "account" collection. Every user has one account and every account has a "user" field that is the _id of the corresponding user. The reason I separated these into two collections is because I thought it made sense to keep the user's sensitive data (password, email, legal name, etc.) separate from the account data (things like interests, followers, username, etc.). Also the account collection has a lot of fields so it just seemed easier to not over-saturate the "user" collection with data.
So, my question is - Now that I essentially have 2 collections pointing to the same user, should I use the "user._id" to query both users and accounts? Since each account has a unique "user" field, is there a reason to query those accounts with their own _id property? It seems odd to keep track of two different _id's on the frontend and conditionally send either the user._id or account._id.
The two main drawbacks I have found when using the user._id to query both users and accounts is:
When querying account data, I have to almost always make sure I send the "user" field so I have that id on the front end.
If in the future, I wanted to add the ability for users to create multiple accounts, I would have to change the code to now fetch account data using the "account._id".
Hopefully that all makes sense, and maybe it doesn't even make sense for me to separate those collections. Thank you to anyone who can help!

Achieving Strong Consistency Using get_or_insert

I have a model like this:
class UserModel(ndb.Model):
''' model class which stores all the user information '''
fname = ndb.StringProperty(required=True)
lname = ndb.StringProperty(required=True)
sex = ndb.StringProperty(required=True, choices=['male', 'female'])
age = ndb.IntegerProperty(required=True)
dob = ndb.DateTimeProperty(required=True)
email = ndb.StringProperty(default=None)
mobile = ndb.StringProperty(required=True)
city = ndb.StringProperty(required=True)
state = ndb.StringProperty(required=True)
Since none of above fields are unique, not even email becuase many people may no have email ids. So I am using the following logic to create a string id
1. Take first two letters of 'state' and change it to upper case.
2. Take first to letters of 'city' and change it to upper case.
3. Get the count of all records in the database and increment by one.
4. Append all of them together.
I am using get_or_insert for inserting the entity.
Though adding a user, will not happen too often but any kind of clash would be catastrophic, means probability of contention is less but its impact is very high.
My questions are:
1. Will using get_or_insert guarantee that I will never have duplicate IDs?
2. get_or_insert documentation says "Transactionally retrieves an existing
entity or creates a new one.". How can something perform an operation
"transactionally" without using a ancestor query.
PS: For several reasons I can't keep all the user entities in the same entity groups.
In order to provide transactionality, get_or_insert uses a Datastore transaction. In order to use a query in a transaction it must be an ancestor query, however transactions can also get and put, which don't require a parent to be set on the entity.
However, as #Greg mentioned, you absolutely do not want to use this scheme for generating user ids. In particular, doing a count on your db is incredibly slow and will not scale, and is eventually consistent. Because the query is eventually consistent, it may return a count smaller than the actual count as long as results are eventually consistent (which for a large app will be all the time). This means you could wait several hours before an insert would actually succeed.
If you want to provide a customer ID with a State and City, I would recommend doing the following:
Do a put using automatic ids.
Expose to the user a "Customer ID" which is the State + City + ID.
When you want to lookup a customer given their "Customer ID", just do a get for the ID portion.
if you keep that ID scheme (for which you honestly don't really need steps 1 and 2, just 3), there is no reason for it to create duplicate IDs. With get_or_insert, it'll look for the exact ID you provide and fetch it if it exists, or simply create it if it doesn't, as explained here. So you CANNOT have duplicate IDs (well if you have this ID as your forced key in your model). if you follow the link provided it clearly states that :
The get and subsequent (possible) put operations are wrapped in a transaction to ensure atomicity. Ths means that get_or_insert() will never overwrite an existing entity, and will insert a new entity if and only if no entity with the given kind and name exists.
And the fact it does it transactionnaly means it'll lock up the entity group to be sure you don't have contention. Since you don't seem to have ancestors I think it'll just lock the entity you're updating

Consistency in a Login Model using the Google Cloud Datastore

I'm trying to get my head around a login model that uses several authentication methods.
1.) For example, when a new user tries to log in with OpenID my backend is going to insert two entities into the datastore:
Insert a new user, where the automatically inserted id will be his $userId
(kind: User, id: autoId)
Insert a new login that is linked to the $userId
(kind: AuthOpenid, name: $openId), Property(userId: $userId)
This will allow me to make lookup by key requests when a user tries to log in, which enforces strongly consistent data, right?
The idea is that one user can have many logins (like stackexchange) and I don't have to worry about write/read limits because no entities have ancestors while still enforcing consistency.
2.) On a related note: Assuming my users are allowed to pick a username once they have provided an authentication method, how do I efficiently check if a username is taken?
My idea was to insert a new entity for every picked username.
Insert a new username
(kind: Username, name: $username)
Now I can simply make a lookup by key request to see if a username is taken. As far as I know, common lookups will be stored in memcache anyways, so this should be efficient, right?
I could also reverse the procedure and just attempt to insert a username and see if it fails.
1) Your approach looks good. As you've noted, Lookup operations (lookup by key) are guaranteed to return consistent results.
You're also correct that by putting each AuthOpenid entity in its own entity group (no common ancestor), you will avoid the write throughput limit of 1 write/second on any particular entity group (there's no corresponding limit on rate of entity group creation).
2) This will also work, but you will need to execute the read and write operations as part of a transaction. This ensures that if two users try to reserve the same username, only one of them will succeed.
In Cloud Datastore, an insert mutation will fail if an entity with the same key already exists, so this will also work.
(Note that this is different from the put() operation in the App Engine Datastore which uses upsert semantics.)

Simple Database - Design issues

Just a homework question I am trying to figure out, I would appreciate some assistance.
Apparently, there are three problems with the design of this database design:
Account = {AccNumber, Type, Balance}
Customer = {CustID, FirstName, LastName, Address, AccNumber}
The one that is pretty obvious is that 'CustID' is useless if 'AccNumber' exists.
I am not quite sure about the second and third problem.
Is there a problem with a separate attribute for 'FirstName' and "LastName', cant we just use 'Name'?
And another option, if 'AccNumber' is the primary key (assuming CustID will be removed), it probably should be place in the beginning :
Such as:
Customer = {AccNumber, Name, Address}
Any input would be appreciated!
Thanks
The customer-account relationship, at first glance, appears to be a many-many relationship, which necessitates the use of an intermediary relationship table. For instance, I have three accounts of my own at my bank. In addition, my wife has two of her own. Finally, we have a shared account. The schema above could not well handle such relationships.
You could, indeed, just use "Name" - but you may need to know what the first or last names are at some point in the future and such a concatination can be quite problematic to split.
Good luck with your homework...
The problem is that you haven't presented us with what the database should represent in words; as it is now, there's nothing "wrong" with the design, since we don't know what the design is supposed to model.
I certainly wouldn't say that CustID is useless, as it serves as the primary key of the table. What you need to determine is the relationship between customers and accounts. It should be one of the following:
A single customer can be tied to multiple accounts, but a single account can be tied to a single customer
A single customer can be tied to only one account, but an account can be tied to multiple customers.
A single customer can be tied to multiple accounts, and a single account can be tied to multiple customers
Right now, with AccNumber in the Customer table, your design models #2.
How is is designed right now, each customer could only have one bank account.
The many-to-many relationship will be a problem. Instead, you might create a third table that holds the relationships. For example:
Account = {AccNumber, Type, Balance}
Connection = {ConnID, AccNumber, CustID}
Customer = {CustID, FirstName, LastName, Address}
This way, both Account and Customer are parented by Connection (for lack of a better name). You could query all connections with a certain AccNumber and find all the customers using that account, and vice versa.

Modeling Votes on GAE

I'm trying to determine the most efficient way to create a votable entity on GAE's datastore. I would like to show the user a control to vote for this entity or an icon indicating that they have already voted for it; ie, I'm asking "has a user voted on this entity?" Lets say that we have a Question entity that a user may up-vote. Here is what I'm thinking of doing:
Query for my Question entities. These questions already have a precalculated ranking on which I will sort.
Use a relation index entity that is a child of the Question entity. Query for all Questions using the same filters as #1 where my user is a member of this relation index entity.
Merge the results of #2 into #1 by setting a hasVoted property to true for each found set member.
This is the cleanest way I could think of doing it but it still requires two queries. I didn't to create duplicate Question entities for each user to own because it would cause too much data duplication. Is this solution a good way to handle what is effectively a join between a m2m relationship between Votes and Questions or am I thinking too relationally?
Instead of using a relation index, just have a child entity for each user that's voted on the question. Make the key_name of the child entity the ID of the user. Then, to determine if a user y has voted on a question with ID x, simply fetch the key (Question:x/Vote:y). You can batch this to fetch multiple entities for multiple questions or users, too.
I would take a look to Overheard Google App Engine sample application.
Our basic model will be to have
Quotes, which contain a string for the
quotation, and Votes, which contain
the user name and vote for a
particular user.
There's a Google article about it and here you can find the sources.
To avoid the second query, you could store all of the questions a user has voted on in a single entity. This entity could be part of the User model, or exist in a one-to-one relationship with User entities.
You could then load this information as needed (and store it to memcache to avoid datastore loads) so that you can quickly check if a user has already voted on a question (without doing a second query most of the time).
If a user may vote on a really large number of questions, then you may have to extend this idea. Here is an outline (not functionally complete) of how you might go about the simple scheme:
class UserVotes(db.Model):
# key = key_name or ID of the corresponding user entity
# if all of your question entities have IDs, then voted_on can be a list of
# integers; otherwise it can be a list of strings (key_name values)
voted_on = db.ListProperty(int, indexed=False)
# in your request handler ...
questions = ...
voted_on = memcache.get('voted-on:%s' % user_id)
if voted_on is None:
voted_on = UserVotes.get_by_id(user_id) # should do get_or_insert() instead
memcache.set(...)
for q in questions:
q.has_voted = q.key().id() in voted_on

Resources