How to get "relationships" elements efficiently on datastore? - google-app-engine

I'm developing a social login system using google datastore. I need to authenticate the user using its social identity and then return the information of all its identities. The client can login with multiple social accounts and also with an identity created on my site so it basically has multiple social identities plus my site identity. Currently I'm using running 3 queries (sequentially) which I feel it's a bit too much so I'm wondering if there is a better way to do this:
// get the username registered with my site(if is registered)
- userID = SELECT userID From social WHERE socialID == $socialID
// get the data of the user
- userData = SELECT * from MyData WHERE userID == userID
// get the data of any other identity it uses / has linked to the user id
- otherSocial = select * FROM social WHERE userID=userID and socialID != $socialID

You can get userData by its key, which is faster and cheaper than running a query. In order to be able to do that, you should use userId as an id for userData.
Your third query is probably needed only in some rare circumstances, e.g. when a user accesses account settings. In either case, I would not worry too much about these queries: they retrieve a small number of entities, which means that they execute very fast.
You can store some data in a session, so you don't have to retrieve it until the next session. I store a LoginOption entity, which is an equivalent of your userId and socialId. Thus, I can bypass the first query until a user logs out or a session expires.

Related

Appengine ndb - How to ensure unique username and email without ancestors?

In my Appengine (using ndb) application I store users and both username and email need to be unique.
I also need to be able to update progress (save level if higher than previously stored level), change email and pw and delete account.
I noticed that it is not possible to query without ancestors in a transaction. But creating an ancestor is NOT a solution since that would limit the number of writes to 1 per second which is not OK if the app gets popular. So I need another solution.
Is it possible to use the Key? Yes, but that only makes the username unique, how can I make sure noone is reusing the email for another account?
You should be able to use a cross group transaction for this along with an entity that exists solely for reserving email addresses.
For your User entity, you could use the username as the key name. When creating a user, you also create an EmailReservation entity that has the user's email address as a key name.
You then use a cross-group transaction to create a new user:
#ndb.transactional(xg=True)
def create_user(user_name, email):
user = User.get_by_id(user_name)
email_reservation = EmailReservation.get_by_id(email)
if user or email_reservation:
# Either the user_name or email is already in use so stop
return None
# Create the user and reserve the email address so others can't use it
user = User(id=user_name)
email_reservation = EmailReservation(id=email)
ndb.put_multi(user, email_reservation)
return user

Sharded ancestor entities in GAE

I'm working on a GAE-based project involving a large user base (possibly millions of users). We use Datastore for persistency. Users will be identified both by username and by e-mail address, so these two properties should be unique across all entities of the kind. Because Datastore doesn't support unique fields other than ID, we need transactions to ensure uniqueness of these fields when new users are registered. And in order to have transactions, User entities need to be enclosed in entity groups.
Having large entity groups is not recommended, as pointed out here. Therefore, given a possible large number of stored users, I'm thinking of putting them into multiple smaller entity groups. Each group would have a common parent with ID generated from the two unique fields (a piece of the MD5 sum for instance). Inserting a new user could look like this (in Python):
#ndb.transactional
def register_new_user(login, email, full_name) :
# validation code omitted
user = User(login = login, email = email, full_name = full_name)
group_id = a_simple_hash(login, email)
group_key = ndb.Key('UserGroup', group_id)
query = User.query(ancestor = group_key).filter(ndb.OR(User.login = login, User.email = email))
if not query.get() :
user.put()
One problem I see with this solution is that it will be impossible to get a User by ID alone. We'd have to use complete entity keys.
Are there any other cons of such approach? Anyone tried something similar?
EDIT
As I've been pointed out in comments, a hash like the one outlined above would not work properly because it would only prevent registering users having non-unique e-mails together with non-unique usernames matching those e-mails. It would work if the hash was computed based on a single field.
Nevertheless, I find the concept of such sharding interesting by itself and perhaps worth of discussion.
An e-mail address is owned by a user and unique. So there is a very small change, somebody will (try to) use the same email address.
So my approch would be: get_or_insert a new login, which makes it easy to login (by key) and next verify if the e-mail address is unique.
If it not unique you can discard or .....do something else
Entity groups have meaning for transactions. I'am interested in your planned transactions, because I do not understand your entity group key hash. Which entities will be part of the entity group, and why?
A user with the same login will be part of another entity group, If i do understand your hash?
It looks like your entity group holds a single entity.
In my opinion you're overthinking here : what's the probability of having two users register with the same username at the same time ?
Very slim. Eventual consistency is good enough for this case, as you don't nanosecond precision...
unless you plan to have more users than facebook, with people registering every second.
Registering with the same email is virtually impossible for different users, since the check has already been done by the email provider for you!
Only a user could try to open two accounts with the same email address. Eventual consistency is good enough for this query too.
Your user entities each belong to their own entity group.
Actually in most use cases, your User is the most obvious root entity : people use the datastore because they need scalability, and most of the time huge scale is needed for user oriented apps.

How to prevent user to access other users' data?

PROBLEM
User authenticated into the application
Simple database schema: User ---> Document ---> Item
API to access to Document Items
If the logged user knows the id of items that belong to some other user, he can access to it.
I would like to prevent this behavior.
SOLUTION
The first solution I found is to add a userid field to every records in every table to check at every query if the record belong to the logged user.
This is a good solution? Do you know some better design pattern to prevent the user to access other users' data?
Thanks
If the documents belong to a user, adjust your queries so that only items that belong to the user's documents are retrieved. No need to add userIDs to the items themselves.
If you need to expose IDs to the users, make those IDs GUIDs, instead of consecutive numbers. While not a perfect solution, it makes it much harder to guess the IDs of other users' items,
If you're using Oracle, there's VPD, Virtual Private Database. You can use that to restrict access for users.

Consistency in a Login Model using the Google Cloud Datastore

I'm trying to get my head around a login model that uses several authentication methods.
1.) For example, when a new user tries to log in with OpenID my backend is going to insert two entities into the datastore:
Insert a new user, where the automatically inserted id will be his $userId
(kind: User, id: autoId)
Insert a new login that is linked to the $userId
(kind: AuthOpenid, name: $openId), Property(userId: $userId)
This will allow me to make lookup by key requests when a user tries to log in, which enforces strongly consistent data, right?
The idea is that one user can have many logins (like stackexchange) and I don't have to worry about write/read limits because no entities have ancestors while still enforcing consistency.
2.) On a related note: Assuming my users are allowed to pick a username once they have provided an authentication method, how do I efficiently check if a username is taken?
My idea was to insert a new entity for every picked username.
Insert a new username
(kind: Username, name: $username)
Now I can simply make a lookup by key request to see if a username is taken. As far as I know, common lookups will be stored in memcache anyways, so this should be efficient, right?
I could also reverse the procedure and just attempt to insert a username and see if it fails.
1) Your approach looks good. As you've noted, Lookup operations (lookup by key) are guaranteed to return consistent results.
You're also correct that by putting each AuthOpenid entity in its own entity group (no common ancestor), you will avoid the write throughput limit of 1 write/second on any particular entity group (there's no corresponding limit on rate of entity group creation).
2) This will also work, but you will need to execute the read and write operations as part of a transaction. This ensures that if two users try to reserve the same username, only one of them will succeed.
In Cloud Datastore, an insert mutation will fail if an entity with the same key already exists, so this will also work.
(Note that this is different from the put() operation in the App Engine Datastore which uses upsert semantics.)

Social Network Activity Feed + Groups

I'm working on an social network app and want to create an activity feed so people can keep up to date with all of their connections (classic facebook stream). I have a DB table called activity setup for this like so:
activity_id (int)
user_id (int) //who posted it
group_id (int) //the group of connections that have permission to view
type (enum) //the type of activity performed
time (datetime) //the time the activity was performed
I would then do a select * from activity where user_id in (connections) to get the latest news.
Here's the catch. User's activities do not always have visibility to the complete set of connections. Users can create groups of user ids to form smaller sets within their super set of connections. Its like how facebook allows you to specify who sees a particular post instead of allowing all friends to see it.
I have a separate groups table setup with the following schema:
group_id (int)
connection_id (user_id, int)
user_id (group creator)
I have a group_id in my activity table. The group_id is the link to the subset of connections that have permission to see the post.
My question is, what is the best way to do this type of feed, and is there an optimal single select statement that will get me the output desired (a list of my connections activities that I have been granted permission to see)?
Thanks.
If you're open to offloading your activity stream functionality to a service via an API, Collabinate (http://www.collabinate.com) may be useful to you.

Resources