Efficient group membership test for ACLs on AppEngine - google-app-engine

I'm creating an access control list for objects in my datastore. Each ACL entry could have a list of all user ids allowed to access the corresponding entry. Then my query to get the list of entities a user can access would be pretty simple:
select * from ACL where accessors = {userId} and searchTerms >= {search}
The problem is that this can only support 2500 users before it hits the index entry limit, and of course it would be very expensive to put an ACL entry with a lot of users because many index entries would need to be changed.
So I thought about adding a list of GROUPs of users that are allowed to access an entity. That could drastically lower the number of index entries needed for each ACL entry, but querying gets longer because I have to query for every possible group that a user is in:
select * from ACL where accessors = {userId} and searchTerms >= {search}
for (GroupId id : theSetOfGroupsTheUserBelongsTo) {
select * from ACL where accessingGroups = {id} and searchTerms >= {search}
}
mergeAllTheseResultsTogether()
which would take a long time, be much more difficult to page through, etc.
Can anyone recommend a way to fetch a list of entities from an ACL that doesn't limit the number of accessing users?
Edit for more detail:
I'm searching and sorting on a long set of academic topics in use at a school. Some of the topics are created by administrators and should be school-wide. Others are created by teachers and are probably only relevant to those teachers. I want to create a google-docs-list-like hierarchy of collections that treats each topic like a document. The searchTerms field would be a list of words in the topic name - there is not a lot of internal text to search. Each topic will be in at least one collection (the organization's "root" collection) and could be in as many as 10-20 other collections, all managed by different people. Ideally there'd be no upper limit to the number of collections a document might appear in. My struggle here is to produce a list of all of the entities a particular user has at least read access to - the analog in google docs would be the "All Items" view.

Assuming that your documents and group permissions change less often (or are less time critical) than user queries, I suggest this (which is how i'm solving a similar problem):
In your ACL, include the fields
accessors <-- all userids that can access the document
numberOfAccessors <-- store the length of accessors whenever you change that field
searchTerms
The key_name for ACL would be something like "indexed_document_id||index_num"
index_num in the key allows you potentially have multiple entities storing the list of users, incase there are more than 5000 (the datastore limit on items in a list) or however many you want to have in a list to reduce the cost of loading one up (though you wont need to do that often).
Don't forget that the document to be accessed should be the parent of the index entity. that way you can do a select __key__ query rather than a select * (this avoids having to deserialize the accessor and searchTerms fields). You can search and return the parent() of the entity without needing to access any of the fields. More on that and other gae search design at this blog post. Sadly that block post doesn't cover ACL indexes like ours.
Disclaimer: I've now encountered a problem with this design in that what document a user has access to is controlled by whether they are following that user. That means that if they follow or unfollow, there could be a large number of existing documents the user needs to be added/removed from. If this is the case for you, then you might be stuck in the same hole as me if you follow my technique. I currently plan to handle this by updating the indexes for old documents in the background, over time. Someone else answering this question might have a solution to it baked in - if not I may post it as a separate question.
Analysis of operations on this datastructure:
Add an indexed document:
For each group that has access to the document, create an entity which includes all users that can access it in the accessors field
If there are too many to fit in one field, make more entities and increment that index_num value (using sharded counters).
O(n*m) where n is number of users and m is number of search queries
Query an indexed document:
select __key__ from ACL where accessors = {userid} and searchTerms >= {search} (though i'm not sure why you do ">=" actually, in my queries it's always "=")
Get all the parent keys from these keys
Filter out duplicates
Get those parent documents
O(n+m) where n is the number of users and m is the number of search terms - this is pretty fast. it uses the zig-zag merge join of two indexes (one on accessors, one on searchterms). this assumes that gae index scans are linear. they might be logarithmic for "=" queries but i'm not privy to the design of their indexes nor have i done any tests to verify. note also that you dont need to load any of the properties of the index entity.
Add access for a user to a particular document
Check if the user already has access: select __key__ from ACL where accessor = {userid} and parent = {key(document)}
If not, add it: select * from ACL where parent = {key(document)} and numberOfAccessors < {5000 (or whatever your max is)} limit 1
Append {userid} to accessors and put the entity
O(n) where n is the number of people who have access to the document.
Remove access for a user to a particular document
select * from ACL where accessor = {userid} and parent = {key(document)}
Remove {userid} from accessors and put the entity
O(n) where n is the number of people who have access to the document.
Compact the indexes
You'll have to do this once in a while if you do a lot of removals. not sure the best way to detect this.
To find out whether there's anything to compact for a particular document: select * from ACL where parent = {key(document)} and numberOfAccessors < {2500 (or half wahtever your max is)}
For each/any pair of these: delete one, appending the accessors to the other
O(n) where n is the number of people who have access to the document

Related

Efficiently modelling a Feed schema on Google Cloud Datastore?

I'm using GCP/App Engine to build a Feed that returns posts for a given user in descending order of the post's score (a modified timestamp). Posts that are not 'seen' are returned first, followers by posts where 'seen' = true.
When a user creates a post, a Feed entity is created for each one of their followers (i.e. a fan-out inbox model)
Will my current index model result in an exploding index and/or contention on the 'score' index if many users load their feed simultaneously?
index.yaml
indexes:
- kind: "Feed"
properties:
- name: "seen" // Boolean
- name: "uid" // The user this feed belongs to
- name: "score" // Int timestamp
direction: desc
// Other entity fields include: authorUid, postId, postType
A user's feed is fetched by:
SELECT postId FROM Feed WHERE uid = abc123 AND seen = false ORDER BY score DESC
Would I be better off prefixing the 'score' with the user id? Would this improve the performance of the score index? e.g. score="{alphanumeric user id}-{unix timestamp}"
From the docs:
You can improve performance with "sharded queries", that prepend a
fixed length string to the expiration timestamp. The index is sorted
on the full string, so that entities at the same timestamp will be
located throughout the key range of the index. You run multiple
queries in parallel to fetch results from each shard.
With just 4 entities I'm seeing 44 indexes which seems excessive.
You do not have an exploding indexes problem, that problem is specific to queries on entities with repeated properties (i.e properties with multiple values) when those properties are used in composite indexes. From Index limits:
The situation becomes worse in the case of entities with multiple
properties, each of which can take on multiple values. To accommodate
such an entity, the index must include an entry for every possible
combination of property values. Custom indexes that refer to multiple properties, each with multiple values, can "explode"
combinatorially, requiring large numbers of entries for an entity with
only a relatively small number of possible property values. Such
exploding indexes can dramatically increase the storage size of an entity in Cloud Datastore, because of the large number of index
entries that must be stored. Exploding indexes also can easily cause
the entity to exceed the index entry count or size limit.
The 44 built-in indexes are nothing more than the indexes created for the multiple indexed properties of your 4 entities (probably your entity model has about 11 indexed properties). Which is normal. You can reduce the number by scrubbing your model usage and marking as unindexed all properties which you do not plan to use in queries.
You do however have the problem of potentially high number of index updates in a short time - when a user with many followers creates a post with all those indexes falling in a narrow range - hotspots, which the article you referenced applies to. Pre-pending the score with the follower user ID (not the post creator ID, which won't help as the same number of updates on the same index range will happen for one use posting event regardless of sharding being used or not) should help. The impact of followers reading the post (when the score properly is updated) is less impactful since it's less likely for all followers to read the post exactly in the same time.
Unfortunately prepending the follower ID doesn't help with the query you intend to do as the result order will be sorted by follower ID first, not by timestamp.
What I'd do:
combine the functionality of the seen and score properties into one: a score value of 0 can be used to indicate that a post was not yet seen, any other value would indicate the timestamp when it was seen. Fewer indexes, fewer index updates, less storage space.
I wouldn't bother with sharding in this particular case:
reading a post takes a bit of time, one follower reading multiple posts won't typically happen fast enough for the index updates for that particular follower to be a serious problem. In the rare worst case an already read post may appear as unread - IMHO not bad enough for justification
delays in updating the indexes for all followers again is IMHO not a big problem - it may just take a bit longer for the post to appear in a follower's feed

GAE datastore index vs normalisation

Given below entity in google app engine datastore, is it better to define index on reportingIds or define a separate entity which has only personId and reportingIds fields? Based on the documentation I understood, defining index results in increase of count of operations against datastore quota.
Below are entities in GAE Go. My code needs to scan through Person entities frequently. It needs to limit its scan to Person entity that has at least 1 reporting person. 2 approaches I see. Define index on reportingIds and Query by specifying filters. Create/Update PersonWithReporters entity when ever a Person gets a new reporting person. In the second case, my code needs to iterate through all the entities in PersonWithReporters and need not construct any index/query. I can iterate using Key which is always guaranteed to have the latest data. Not sure which approach is beneficial considering datastore operation counts against quota limit.
type Person struct {
Id string //unique person id
//many other personal details, his personal settings etc
reportingIds []string //ids of the Person this guy manages
}
type PersonWithReporters struct {
Id string //Person managing reportees
reportingIds []string //ids of the Person this guy manages
}
A approach with a separate entity gives you two advantages.
As you have already mentioned, you don't need to index/query all Person entities.
Every time a Person gets a new reporting person, you will create a new entity, which may be significantly cheaper than updating a Person entity which has many other properties, some of which, presumably, are indexed.
Your approach with a separate entity is also not ideal. When you index a property with multiple values, under the hood the Datastore creates an index entry for each value. So, when you add reporting person number 3 to this entity, you have to update 3 index entries instead of 1.
You can optimize your data model even further by creating a Reporter entity with no properties! Every time a new reporting person is added, you create this Reporter entity with ID set to the ID of a reporting person, and make it a child entity of a Person entity representing a person to whom this reporter reports.
Now, when you need to iterate through all persons with someone reporting to them, you run a simple query on this Reporter entity - no filters. This query can be set to keys-only (there is nothing than a key in this entity anyway, but keys-only queries are treated differently - they are basically free).
For every entity returned by this query you retrieve its key, and this key contains an ID (which is an ID of a reporting person), and a parent key, which includes an ID of a person who this reporter reports to.
Unless AppEngine's datastore in Go is very different to how it works in Java or Python you cannot index an array natively - So option 1 is out of the question, and so is option 2.
I suggest option three, which is to define a
type PersonWithReporters {
Id string // concatenate(managing_Person_id, separator, reporter_Person_id) to avoid id collisions
reportingId string; // indexed
managingId string; // probably indexed as well
}
You would create multiple of these entities instead of a single entity with an array. Also you add an index on reportingId. Now you can create a filter query on this entity and should be able to retrieve the desired information.
I would worry more about performance and not too much about the quota limits, they are pretty high. Just implement it, see how it works and whether quota is your main concern here.

Google App Engine - What's the recommended way to keep entity amount within limit?

I have some entities of a kind, and I need to keep them within limited amount, by discarding old ones. Just like log entries maintainance. Any good approach on GAE to do this?
Options in my mind:
Option 1. Add a Date property for each of these entities. Create cron job to check datastore statistics daily. If it exceeds the limit, query some entities of that kind and sort by date with oldest first. Delete them until the size is less than, for example, 0.9 * max_limit.
Option 2. Option 1 requires an additional property with index. I observed that the entity key ids may be likely increasing. So I'd like to query only keys and sort by ascending order. Delete the ones with smaller ids. It does not require additional property (date) and index. But I'm seriously worrying about whether the key id is assured to go increasingly?
I think this is a common data maintainance task. Is there any mature way to do it?
By the way, a tiny ad for my app, free and purely for coder's fun! http://robotypo.appspot.com
You cannot assume that the IDs are always increasing. The docs about ID gen only guarantee that:
IDs allocated in this manner will not be used by the Datastore's
automatic ID sequence generator and can be used in entity keys without
conflict.
The default sort order is also not guaranteed to be sorted by ID number:
If no sort orders are specified, the results are returned in the order
they are retrieved from the Datastore.
which is vague and doesn't say that the default order is by ID.
One solution may be to use a rotating counter that keeps track of the first element. When you want to add new entities: fetch the counter, increment it, mod it by the limit, and add a new element with an ID as the value of the counter. This must all be done in a transaction to guarantee that the counter isn't being incremented by another request. The new element with the same key will overwrite one that was there, if any.
When you want to fetch them all, you can manually generate the keys (since they are all known), do a bulk fetch, sort by ID, then split them into two parts at the value of the counter and swap those.
If you want the IDs to be unique, you could maintain a separate counter (using transactions to modify and read it so you're safe) and create entities with IDs as its value, then delete IDs when the limit is reached.
You can create a second entity (let's call it A) that keeps a list of the keys of the entities you want to limit, like this (pseudo-code):
class A:
List<Key> limitedEntities;
When you add a new entity, you add its key in the list of A. If the length of the list exceeds the limit, you take the first element of the list and the remove the corresponding entity.
Notice that when you add or delete an entity, you should modify the list of entity A in a transaction. Since, these entities belong to different entity groups, you should consider using Cross-Group Transactions.
Hope this helps!

MongoDB - Search _id in an array or not

I want to implement a user follow system. A user can follow other users. I'm considering two approaches. One is that there are followers and followees in User schema, both of them are arrays of user _id. The other one is that there's only followers in the schema. Whenever I want to find a user's followers, I have to search all users' followers array, that is, db.user.find( { followers: "_id" } );. What the pros and cons of the two approaches? Thanks.
What you're considering is a classic "many-to-many" relationship here. Unlike a RDBMS, where there is a single "correct" normal form for this schema, in MongoDB the correct schema design depends on the way you'll be using your data, as well as a couple of other factors you haven't mentioned here.
Note that for this discussion I'm assuming that the "follows" relationship is NOT symmetric -- that is, that A can follow B without B having to follow A.
1) There are two basic ways to model this relationship in MongoDB.
You can have an indexed "following" array embedded in the user document.
You can have a separate collection of "following" documents, like this:
{ user: ObjectID("x"), following: ObjectID("y") }
You'd have one document in this collection for each following relationship. You'd need to have two indexes on this collection, one for "user" and one for "following".
Note that the second suggestion in your question (having arrays of both "following" and "followed" in the user document) is simply a variation of the first.
2) The correct design depends on a few factors that you haven't mentioned here.
How many followers can one person have, and how many people can one person follow?
What is your most common query? Is it to present a list of followers, or to present a list of users that are being followed?
How often will you be updating the followers/following list(s)?
3) The trade-offs are as follows:
The advantages to the embedded array approach are that the code is simpler, and you can fetch the entire array of followed users in a single document. If you index the 'following' array, then the query to find all a users followers will be relatively quick, as long as that index fits entirely in RAM. (This is no different than a relational database.)
The disadvantages to the embedded array approach occur if you are frequently updating the followers, or if you allow an unlimited number of followers / following.
If you allow an unlimited number of followers/following, then you can potentially overflow the maximum size of a MongoDB document. It's not unheard-of for some people to have 100K followers or more. If this is the case, then you'll need to go to the separate collection approach.
If you know that there will be frequent updates to the followers, then you'll probably want to use the separate collection approach as well. The reason is that every time you add a follower, you grow the size of the 'followers' array. When it reaches a certain size, it will outgrow the amount of space reserved for it on disk, and MongoDB will have to move the document. This will incur additional write overhead, as all of the indexes for that document will have to be updated as well.
4) If you want to use the embedded array approach, there are a couple of things that you can do to make that more feasable.
First, you can limit the total number of followers that one person can have. Second, when you create a new user, you can create the document with a large number of dummy followers pre-created. (E.g., you populate the 'followers' array with a large number of entries that you know don't refer to any actual user -- perhaps ID 0.) That way, when you add a new follower, you replace one of the ID 0 entries with a real entry, and the document size doesn't grow.
Second, you can limit the number of followers that someone can have, and check for that in the application.
Note that if you use the two-array approach in your document, you will cut the maximum number of followers that one person can have (since a portion of the document will be taken up with the array of users that they are following).
5) As an optimization, you can change the 'following" documents to be bucketed. So, instead of one document for each following relationship, you might bucket them by user:
{ user: "X", following: [ "A", "B", "C" ... ] }
{ user: "X", following: [ "H", "I", "J" ... ] }
{ user: "Y", following: [ "A", "X", "K" ... ] }
6) For more about the ways to model many-to-many, see this presentation:
http://www.10gen.com/presentations/mongosf2011/schemabasics
For more information about the "bucketing" design pattern, see this entry in the MongoDB docs:
http://docs.mongodb.org/manual/use-cases/storing-comments/#hybrid-schema-design
If you provide both followers and followees then you can probably service most of your queries efficiently without a secondary index on either of those fields. For example, you can retrieve the current user and then use the default index on _id to retrieve lists of all of their connections.
db.users.find({_id: {$in: user_A.followers}})
If you don't include followees, you need to create a secondary index on followers in order to service some queries without a collection scan. For example, to determine all of the followees of user A, you would use a query as follows:
db.users.find({followers: user_A._id})
The secondary index costs you some memory and disk space but avoids potential data inconsistencies (mismatched follower and followee lists).

App Engine Datastore: entity design and query optimization

I have a system where users can vote on entities, if they like or hate them. It will be bazillion votes and trazillion records, hopefully, some time in the future :)
At the moment i store a vote in an Entity like this:
UserRecordVote: recordId, userId, hateOrLike
And when i want to get every Record the user liked i do a query like this:
I query the "UserRecordVote" table for all the "likes", then i take the recordIds from that resultset, create a key of that property and get the record from the Record Table.
Then i aggregate all that in a list and return it.
Here's the question:
I came up with a different approach and i want to find out if that one is 1. faster and 2. how much is the difference in cost.
I would create an Entity which's name would be userId + "likes" and the key would be the record id:
new Entity(userId + "likes", recordId)
So when i would do a query to get all the likes i could simply query for all, no filters needed. AND i could just grab the entity key! which would be much cheaper if i remember the documentation of app engine right. (can't find the pricing page anymore). Then i could take the Iterable of keys and do a single get(Iterable keys). Ok so i guess this approach is faster and cheaper right? But what if i want to grab all the votes of a user or better said, i want to grab all the records a user didn't vote on yet.
Here's the real question:
I wan't to load all the records a user didn't vote on yet:
So i would have entities like this:
new Entity(userId+"likes", recordId);
and
new Entity(userId+"hates", recordId);
I would query both vote tables for all entity keys and query the record table for all entity keys. Then i would remove all the record entity keys matching one of the vote entity keys and with the result i would get(Iterable keys) the full entities and have all the record entites which are not in one of the two voting tables.
Is that a useful approach? Is that the fastest and cost efficient way to do a datastore query? Am i totally wrong and i should store the information as list properties?
EDIT:
With that approach i would have 2 entity groups for each user, which would result in million different entity groups, how would GAE Datastore handle that? Atleast the Datastore Viewer entity select box would probably crash :) ?
To answer the Real Question, you probably want to have your hateOrLike field store an integer that indicates either hated/liked/notvoted. Then you can filter on hateOrLike=notVoted.
The other solutions you propose with the dynamically named entities make it impossible to query on other aspects of your entities, since you don't know their names.
The other thing is you expect this to be huge, you likely want to keep a running counter of your votes rather than tabulating every time you pull up a UserRecord - querying all the votes, and then calculating them on each view is very slow - especially since App Engine will only return 1000 results on each query, and if you have more than 1000 votes, you'll have to keep making repeated queries to get all the results.
If you think people will vote quickly, you should look into using a sharded counter for performance. There's examples of that with code available if you do a google search.
Consider serializing user hate/like votes in two separate TextProperties inside the entity. Use the userId as key_name.
rec = UserRecordVote.get_by_key_name(userId)
hates = len(rec.hates.split('_'))
etc.

Resources