Query with multiple ancestors in Google App Engine(GAE) JAVA - google-app-engine

I have two entities 'user' and 'comment'
I create comments and set ancestor path as user
I added 5 users and some comments per each user.
What I wanted to do is, filter out comments only for two users within one query with limit and offset
I have searched in Google App Engine documentation, but I couldn't find any answer for this.

The reason for not finding anything is that this is not possible by design.
An ancestor query is always a query by id. If you need to find all children of multiple ancestors you have to add the parent id as an indexed column in your child entity and filter by that column.

Related

How do I set ancestors in GAE datastore viewer?

My use case involved 2 kinds, Customers and Orders.
From the docs I read that we can have descendants, the example shows persons as the kind. In my case I want a customer to have a bunch of orders underneath it. I want to try it out in the console before diving in but I can't seem to be able to set the customer as a key to the order. Any help?
This picture shows the Customer I have made. Note the id.
Here is the Order that I want as a descendant of the customer.
As you can see here I tried to put the customerID as a key, but the Ancestor path still points to the order itself.
Is this just a limitation of the console?
Also, if I try it in code, how can I refer to this specific datastore and namespace? I'm going to be doing this in java.
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
This looks like it's just making a new datastore.
You can't set the ancestor of an existing entity. The ancestor is part of the ID, and must be set on creation; you can't change the ID once it's created.
There is only one datastore. That code just creates an instance of the client.

How to create multiple index on two fields in google Objectify which should not be included in update and create request?

I want to create multiple index in an entity
on ID and creation date
There is one condition i dont want to use these index on update and create of that object
I am using Google objectify
I will use these multiple index in my search query
Please help?
Objectify has a feature called partial indexes , which define conditions that a certain property has to meet in order to get indexed.
You could hack that so that those indexed fields are only indexed if a given attribute (ej lastOperation) is not create or update.
Bear in mind tampering with index updates might lead to invalid query results as the index records (used for search) wont match the actual entity values.

Which database model to store data in?

I am writing an application in Google App Engine with python and I want to sort users and user posts into groups. Users will be able to tag a post with a group ID and then that post will be displayed on the group page.
I would also like to relate the users to the groups so that only members of a group can tag a post with that group ID and so that I can display all the users of a group on the side. I am wondering if it would be more efficient to have a property on the user which will have all of the groups listed (I am thinking max 10 or so) or would it be better to have a property on the Group model which lists all of the users (possibly a few hundred).
Is there much of a difference here?
Your data model should derive from the most likely use cases. What are you going to retrieve?
A. Show a list of groups to a user.
B. Show a list of users in a group.
Solution:
If only A, store unindexed list of groups in a property of a user entity.
If both, same as above but indexed.
If only B, store unindexed list of users in a property of a group entity.
NB: If you make a property indexed, you cannot put hundreds of user ids in it - it will lead to an exploding index.

Efficient group membership test for ACLs on AppEngine

I'm creating an access control list for objects in my datastore. Each ACL entry could have a list of all user ids allowed to access the corresponding entry. Then my query to get the list of entities a user can access would be pretty simple:
select * from ACL where accessors = {userId} and searchTerms >= {search}
The problem is that this can only support 2500 users before it hits the index entry limit, and of course it would be very expensive to put an ACL entry with a lot of users because many index entries would need to be changed.
So I thought about adding a list of GROUPs of users that are allowed to access an entity. That could drastically lower the number of index entries needed for each ACL entry, but querying gets longer because I have to query for every possible group that a user is in:
select * from ACL where accessors = {userId} and searchTerms >= {search}
for (GroupId id : theSetOfGroupsTheUserBelongsTo) {
select * from ACL where accessingGroups = {id} and searchTerms >= {search}
}
mergeAllTheseResultsTogether()
which would take a long time, be much more difficult to page through, etc.
Can anyone recommend a way to fetch a list of entities from an ACL that doesn't limit the number of accessing users?
Edit for more detail:
I'm searching and sorting on a long set of academic topics in use at a school. Some of the topics are created by administrators and should be school-wide. Others are created by teachers and are probably only relevant to those teachers. I want to create a google-docs-list-like hierarchy of collections that treats each topic like a document. The searchTerms field would be a list of words in the topic name - there is not a lot of internal text to search. Each topic will be in at least one collection (the organization's "root" collection) and could be in as many as 10-20 other collections, all managed by different people. Ideally there'd be no upper limit to the number of collections a document might appear in. My struggle here is to produce a list of all of the entities a particular user has at least read access to - the analog in google docs would be the "All Items" view.
Assuming that your documents and group permissions change less often (or are less time critical) than user queries, I suggest this (which is how i'm solving a similar problem):
In your ACL, include the fields
accessors <-- all userids that can access the document
numberOfAccessors <-- store the length of accessors whenever you change that field
searchTerms
The key_name for ACL would be something like "indexed_document_id||index_num"
index_num in the key allows you potentially have multiple entities storing the list of users, incase there are more than 5000 (the datastore limit on items in a list) or however many you want to have in a list to reduce the cost of loading one up (though you wont need to do that often).
Don't forget that the document to be accessed should be the parent of the index entity. that way you can do a select __key__ query rather than a select * (this avoids having to deserialize the accessor and searchTerms fields). You can search and return the parent() of the entity without needing to access any of the fields. More on that and other gae search design at this blog post. Sadly that block post doesn't cover ACL indexes like ours.
Disclaimer: I've now encountered a problem with this design in that what document a user has access to is controlled by whether they are following that user. That means that if they follow or unfollow, there could be a large number of existing documents the user needs to be added/removed from. If this is the case for you, then you might be stuck in the same hole as me if you follow my technique. I currently plan to handle this by updating the indexes for old documents in the background, over time. Someone else answering this question might have a solution to it baked in - if not I may post it as a separate question.
Analysis of operations on this datastructure:
Add an indexed document:
For each group that has access to the document, create an entity which includes all users that can access it in the accessors field
If there are too many to fit in one field, make more entities and increment that index_num value (using sharded counters).
O(n*m) where n is number of users and m is number of search queries
Query an indexed document:
select __key__ from ACL where accessors = {userid} and searchTerms >= {search} (though i'm not sure why you do ">=" actually, in my queries it's always "=")
Get all the parent keys from these keys
Filter out duplicates
Get those parent documents
O(n+m) where n is the number of users and m is the number of search terms - this is pretty fast. it uses the zig-zag merge join of two indexes (one on accessors, one on searchterms). this assumes that gae index scans are linear. they might be logarithmic for "=" queries but i'm not privy to the design of their indexes nor have i done any tests to verify. note also that you dont need to load any of the properties of the index entity.
Add access for a user to a particular document
Check if the user already has access: select __key__ from ACL where accessor = {userid} and parent = {key(document)}
If not, add it: select * from ACL where parent = {key(document)} and numberOfAccessors < {5000 (or whatever your max is)} limit 1
Append {userid} to accessors and put the entity
O(n) where n is the number of people who have access to the document.
Remove access for a user to a particular document
select * from ACL where accessor = {userid} and parent = {key(document)}
Remove {userid} from accessors and put the entity
O(n) where n is the number of people who have access to the document.
Compact the indexes
You'll have to do this once in a while if you do a lot of removals. not sure the best way to detect this.
To find out whether there's anything to compact for a particular document: select * from ACL where parent = {key(document)} and numberOfAccessors < {2500 (or half wahtever your max is)}
For each/any pair of these: delete one, appending the accessors to the other
O(n) where n is the number of people who have access to the document

How to design a database schema for a search engine?

I'm writing a small search engine in C with curl, libxml2, and mysql. The basic plan is to grab pages with curl, parse them with libxml2, then iterate over the DOM and find all the links. Then traverse each of those, and repeat, all while updating a SQL database that maintains the relationship between URLs.
My question is: how can I best represent the relationship between URLs?.
Why not have a table of base urls (ie www.google.com/) and a table of connections, with these example columns:
starting page id (from url table)
ending page id (from url table)
the trailing directory of the urls as strings in two more columns
This will allow you to join on certain urls and pick out information you want.
Your solution seems like it would be better suited to a non relational datastore, such as a column store.
Most search engine indices aren't stored in relational databases, but stored in memory as to minimize retrieval time.
Add two fields to table - 'id' and 'parent_id'.
id - unique identifier for URL
parent_id - link between URL's
If you want to have a single entry for each URL then you should create another table that maps the relationships.
You then lookup the URL table to see if it exists. If not create it.
The relationship table would have
SourceUrlId,
UrlId
Where the SourceUrlId is the page and the UrlId is the url it points to. That way you can have multiple relationships for the same URL and you won't need to have a new entry in the Url table for every link to that url. Will also mean only 1 copy of any other info you are storing.
Why are you interested in representing pages graph?
If you want to compute the ranking, then it's better to have a more succinct and efficient representation (e.g., matricial form if you want to compute something similar to PageRank).

Resources