Projection queries with zigzag merge - google-app-engine

I would like to use projection queries on AppEngine together with zigzag merge. It appears that this requires the projected property to be included in every index used by the zigzag merge query. In my use case this would result in entity update costs which are too high.
To illustrate, below is a simple example using the Java low-level Datastore API and using the indices Index(E, p1, p3) and Index(E, p2, p3); this works but duplicates the p3 property of entity E in the two indices.
// Create a sample entity with three (indexed) properties.
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
Entity e = new Entity("E");
e.setProperty("p1", 1);
e.setProperty("p2", 1);
e.setProperty("p3", 1);
datastore.put(e);
// Query for the above entity with a projection on property p3.
Query q = new Query("E");
Filter filter1 = new FilterPredicate("p1", FilterOperator.EQUAL, 1);
Filter filter2 = new FilterPredicate("p2", FilterOperator.EQUAL, 1);
q.setFilter(CompositeFilterOperator.and(filter1, filter2));
q.addProjection(new PropertyProjection("p3", Integer.class));
PreparedQuery pq = datastore.prepare(q);
pq.asList(FetchOptions.Builder.withDefaults());
I'd like to remove one of the composite indices, say Index(E, p2, p3), and just rely on the default index for property p2, thus reducing update costs. But doing so results in a DatastoreNeedIndexException at runtime.
Note that a similar problem occurs if I keep the above two indices but add a fourth property to only one of them and include this fourth property in the projection. The use of a default index therefore does not seem to be the problem.
So my question: is there any way of doing projection queries with zigzag merge without duplicating all the projected properties across indices? If not, I'd like to understand what the underlying technical reason is.
Any pointers greatly appreciated.

Ok, so I now see why the projected property needs to be duplicated in all involved indices: because the index sort order has to be the same in all relevant index blocks (two in this example) for zigzag merge to work.
In the example, the last sort order is done on the projected property. When this index is removed it changes the sort order, and a new sort of indexes would be needed for it to work.
So, I don't think what I'm after is possible on AppEngine at the moment. A new dedicated AppEngine feature would be required to enable indexed properties that do not affect index sort order.

Related

Conditionally creating vertex in Titan

I have a situation where I need to check if a vertex with three satisfying properties property1='a',property2='b',property3='c' already exists in a graph and if it does not exist, I need to create it. Basically there should be a unique vertex in the graph with the combination of these three properties. I have tried out this snippet of gremlin code to check based on one property 'id'
getOrCreate = { id ->
def p = g.V('userId', id)
if (p.hasNext()) ? p.next() : g.addVertex([userId:id])
Not very clear about the best way to modify this to achieve what i need with gremlin since I'm a beginner. All I can think of is nesting more if's and else's in the last statement. Any help is appreciated, thank you.
There are several approaches. One way would be to extend your traversal a bit:
getOrCreate = { one, two, three ->
def p = g.V('prop1', one).has('prop2',two).has('prop3',three)
p.hasNext() ? p.next() : g.addVertex([prop1:one,prop2:two,prop3:three])
In the above code, prop1 represents an indexed property, then you just filter on the rest. That prop should be the most selective property in that it should filter out the most results.
If for some reason prop is not selective enough then this solution might not be fast enough. In other words, if you have 1 billion vertices and g.V('prop1', one) returns 100000 then you will be in-memory filtering those, which will be kinda slow. If this is your case, I would consider creating a "poor-man's" composite index, by adding a fourth property to index on that combines all three properties into one. Then just do your lookups on that.
You're almost there.
getOrCreate = { p1, p2, p3 ->
def p = g.V().has('property1', p1).has('property2', p2).has('property3', p3)
p.hasNext() ? p.next() : g.addVertex(['property1':p1,'property2':p2,'property3':p3])
}
Cheers,
Daniel

Projection query with new fields/properites ignores entries that haven't set those properties yet

I have an Article type structured like this:
type Article struct {
Title string
Content string `datastore:",noindex"`
}
In an administrative portion of my site, I list all of my Articles. The only property I need in order to display this list is Title; grabbing the content of the article seems wasteful. So I use a projection query:
q := datastore.NewQuery("Article").Project("Title")
Everything works as expected so far. Now I decide I'd like to add two fields to Article so that some articles can be unlisted in the public article list and/or unviewable when access is attempted. Understanding the datastore to be schema-less, I think this might be very simple. I add the two new fields to Article:
type Article struct {
Title string
Content string `datastore:",noindex"`
Unlisted bool
Unviewable bool
}
I also add them to the projection query, since I want to indicate in the administrative article list when an article is publicly unlisted and/or unviewable:
q := datastore.NewQuery("Article").Project("Title", "Unlisted", "Unviewable")
Unfortunately, this only returns entries that have explicitly included Unlisted and Unviewable when Put into the datastore.
My workaround for now is to simply stop using a projection query:
q := datastore.NewQuery("Article")
All entries are returned, and the entries that never set Unlisted or Unviewable have them set to their zero value as expected. The downside is that the article content is being passed around needlessly.
In this case, that compromise isn't terrible, but I expect similar situations to arise in the future, and it could be a big deal not being able to use projection queries. Projections queries and adding new properties to datastore entries seem like they're not fitting together well. I want to make sure I'm not misunderstanding something or missing the correct way to do things.
It's not clear to me from the documentation that projection queries should behave this way (ignoring entries that don't have the projected properties rather than including them with zero values). Is this the intended behavior?
Are the only options in scenarios like this (adding new fields to structs / properties to entries) to either forgo projection queries or run some kind of "schema migration", Getting all entries and then Puting them back, so they now have zero-valued properties and can be projected?
Projection queries source the data for fields from the indexes not the entity, when you have added new properties pre-existing records do not appear in those indexes you are performing the project query on. They will need to be re-indexed.
You are asking for those specific properties and they don't exist hence the current behaviour.
You should probably think of a projection query as a request for entities with a value in a requested index in addition to any filter you place on a query.

efficient set operations in google app engine datastore

I'm having problems converting a set problem into an efficient google app engine datastore solution. The problem is as follows. I have an entity defining a relationship between two objects, i.e. something like this:
struct Relation
{
Obj1 int
Obj2 int
// other data
}
Now I want to perform the following query in an efficient manner: given a set of objects set = [obj1, obj2, obj3, obj4], I want to find all Relation entities (E) for which E.Obj1 ∈ set ∧ E.Obj2 ∈ set. Note that I do not know the set beforehand, so I cannot precompute all the entries in the set once. Is there any way to represent this problem in the datastore so that I can efficiently retrieve all the relationships that are part of a given set?
The equivalent GQL query is "SELECT * FROM Kind WHERE Obj1 IN :1 AND Obj2 IN :1", passing in the set as the first parameter. Unfortunately, IN queries expand out to one query for each term, so there's a combinatorial explosion of queries here - 16 queries in the case of a 4 element set. There's not really any way to avoid this with a standard query.

GAE: Best way to add (math) all values of an integer property from multiple instances

I have about 400+ model instances of a certain model ("Grade") in my datastore. They all have an integer property called "points" (points = db.IntegerProperty(default=0)) with different values.
What is the best way to get a cumulative sum of all the "points" values from each instance? Is there a way to do it without having to retrieve all the instances with Model.all()?
You could first do a GQL query to get keys of all entities that have the value of points > 0:
SELECT __key__ FROM Grade WHERE points > 0
This way you only pull up entities that you need, and ignore any entities that mathematically do not matter. Then, you can do a loop where you retrieve each entity you get a key for by doing db.get(Key) for each key and adding points up to a variable.
In GAE, queries that only get keys are more efficient and cost less: http://code.google.com/appengine/docs/python/datastore/queries.html#Queries_on_Keys

About indexes of GAE datastore

I have a following model in the GAE app.
class User
school_name = db.StringProperty(Indexed=True)
country = db.StringProperty(Indexed=True)
city = db.StringProperty(Indexed=True)
sex = db.StringProperty(Indexed=True)
profession = db.StringProperty(Indexed=True)
joined_date = db.DateTimeProperty(Indexed=True)
And I want to filter the users by combinations of these fields. Result of the filter should show a user at first who is joined recently. So which means any query end by order operation, I suppose. like that:
User.all().filter('country =','US').filter('profession =','SE').order('-joined_date')
User.all().filter('school_name =','AAA').filter('profession =','SE').order('-joined_date')
....
User.all().filter('sex =','Female').filter('profession =','HR').order('-joined_date')
All these fields combination would be C(5,1)+C(5,2)+...+C(5,5) = 31.
My question is to implement it, do I need to create indexes for all these cases(31) in the Google AppEngine. Or can you suggest other way to implement it?
Note: C(n,k) is combination formula, see more on http://en.wikipedia.org/wiki/Combination
Thanks in advance!
You have several options:
Create all 31 indexes, as you suggest.
Do the sorting in memory. Without a sort order, all your queries can be executed with the built-in merge-join strategy, and so you won't need any indexes at all.
Restrict queries to those that are more likely, or those that eliminate most of the non-matching results, and perform additional filtering in memory.
Put all your data in a ListProperty for indexing as "key:value" strings, and filter only on that. You will need to create multiple indexes with different occurrence counts on that field (eg, indexing it once, twice, etc), and it will result in the same number of index entries, but fewer custom indexes used.

Resources