efficient set operations in google app engine datastore

efficient set operations in google app engine datastore - google-app-engine

I'm having problems converting a set problem into an efficient google app engine datastore solution. The problem is as follows. I have an entity defining a relationship between two objects, i.e. something like this:
struct Relation
{
Obj1 int
Obj2 int
// other data
}
Now I want to perform the following query in an efficient manner: given a set of objects set = [obj1, obj2, obj3, obj4], I want to find all Relation entities (E) for which E.Obj1 ∈ set ∧ E.Obj2 ∈ set. Note that I do not know the set beforehand, so I cannot precompute all the entries in the set once. Is there any way to represent this problem in the datastore so that I can efficiently retrieve all the relationships that are part of a given set?

The equivalent GQL query is "SELECT * FROM Kind WHERE Obj1 IN :1 AND Obj2 IN :1", passing in the set as the first parameter. Unfortunately, IN queries expand out to one query for each term, so there's a combinatorial explosion of queries here - 16 queries in the case of a 4 element set. There's not really any way to avoid this with a standard query.

Related

How to get the count of property in a Kind?

I have a Kind Students which stores the details of favorite colors of all students. They are allowed to pick their favorite color from a set of three colors : {Red,Blue,Green}
Let us assume there are 100 students, my code is like this for every student :
Entity arya = new Entity("Student","Arya");
arya.setProperty("Color","Red");
Entity robb = new Entity("Student","Robb");
robb.setProperty("Color","Green");
..
..
Entity jon = new Entity("Student","Jon");
jon.setProperty("Color","Blue");
How to find out how many students liked a particular color(say Red) in this Student Kind ? What Query I should write to fetch the count ?
Thanks in advance

The number you seek would be the number of items in the result of a query with an equality filter on the Color property.
You could use a keys-only query (a special kind of projection query) for this purpose, faster and less expensive:
Keys-only queries
A keys-only query (which is a type of projection query) returns just
the keys of the result entities instead of the entities themselves, at
lower latency and cost than retrieving entire entities.
...
A keys-only query is a small operation and counts as only a single
entity read for the query itself.
Something along these lines (but note that I'm not a java user, the snippet is based only on the documentation examples)
Query<Key> query = Query.newKeyQueryBuilder()
.setKind("Student")
.setFilter(PropertyFilter.eq("Color", "Red")
.build();

I agree with the Dan Cornilescu's answer. Here is a direct Datastore API usage. I have prepared the request body for your use-case. You can run it by just adding your Project Id. This will return the entities that matches with the filter then you can count the number of them.

Query given keys

I would like to accomplish some sort of hybrid solution between ndb.get_multi() and Query().
I have a set of keys, that I can use with:
entities = ndb.get_multi(keys)
I would like to query, filter, and order these entities using Query() or some more efficient way than doing all myself in the Python code manually.
How do people go about doing this? I want something like this:
query = Entity.gql('WHERE __key__ in :1 AND prop1 = :2 ORDER BY prop2', keys, 'hello')
entities = query.fetch()
Edit:
The above code works just fine, but it seems like fetch() never uses values from cache, whereas ndb.get_multi() does. Am I correct about this? If not, is the gql+fetch method much worse than get_multi+manual processing?

There are no way to use a query on already fetched properties, unless you will write it by yourself, but all this stuff can be easily done with built-in python filters. Note that its more efficient to run a query if you have a big dataset, rather than get_multi hundreds of keys to get only 5 entities.
entities = ndb.get_multi(keys)
# filtering
entities = [e for e in entities if e.prop1 == 'bla' and e.prop2 > 3]
#sorting by multiple properties
entities = sorted(entities, key=lambda x: (x.prop1, x.prop2))
UPDATE: And yes, cache is only used when you receive your entity by key, it is not used when you query for entities.

Projection query with new fields/properites ignores entries that haven't set those properties yet

I have an Article type structured like this:
type Article struct {
Title string
Content string `datastore:",noindex"`
}
In an administrative portion of my site, I list all of my Articles. The only property I need in order to display this list is Title; grabbing the content of the article seems wasteful. So I use a projection query:
q := datastore.NewQuery("Article").Project("Title")
Everything works as expected so far. Now I decide I'd like to add two fields to Article so that some articles can be unlisted in the public article list and/or unviewable when access is attempted. Understanding the datastore to be schema-less, I think this might be very simple. I add the two new fields to Article:
type Article struct {
Title string
Content string `datastore:",noindex"`
Unlisted bool
Unviewable bool
}
I also add them to the projection query, since I want to indicate in the administrative article list when an article is publicly unlisted and/or unviewable:
q := datastore.NewQuery("Article").Project("Title", "Unlisted", "Unviewable")
Unfortunately, this only returns entries that have explicitly included Unlisted and Unviewable when Put into the datastore.
My workaround for now is to simply stop using a projection query:
q := datastore.NewQuery("Article")
All entries are returned, and the entries that never set Unlisted or Unviewable have them set to their zero value as expected. The downside is that the article content is being passed around needlessly.
In this case, that compromise isn't terrible, but I expect similar situations to arise in the future, and it could be a big deal not being able to use projection queries. Projections queries and adding new properties to datastore entries seem like they're not fitting together well. I want to make sure I'm not misunderstanding something or missing the correct way to do things.
It's not clear to me from the documentation that projection queries should behave this way (ignoring entries that don't have the projected properties rather than including them with zero values). Is this the intended behavior?
Are the only options in scenarios like this (adding new fields to structs / properties to entries) to either forgo projection queries or run some kind of "schema migration", Getting all entries and then Puting them back, so they now have zero-valued properties and can be projected?

Projection queries source the data for fields from the indexes not the entity, when you have added new properties pre-existing records do not appear in those indexes you are performing the project query on. They will need to be re-indexed.
You are asking for those specific properties and they don't exist hence the current behaviour.
You should probably think of a projection query as a request for entities with a value in a requested index in addition to any filter you place on a query.

Does the NDB membership query ("IN" operation) performance degrade with lots of possible values?

The documentation for the IN query operation states that those queries are implemented as a big OR'ed equality query:
qry = Article.query(Article.tags.IN(['python', 'ruby', 'php']))
is equivalent to:
qry = Article.query(ndb.OR(Article.tags == 'python',
Article.tags == 'ruby',
Article.tags == 'php'))
I am currently modelling some entities for a GAE project and plan on using these membership queries with a lot of possible values:
qry = Player.query(Player.facebook_id.IN(list_of_facebook_ids))
where list_of_facebook_ids could have thousands of items.
Will this type of query perform well with thousands of possible values in the list? If not, what would be the recommended approach for modelling this?

This won't work with thousands of values (in fact I bet it starts degrading with more than 10 values). The only alternative I can think of are some form of precomputation. You'll have to change your schema.

One way you can you do it is to create a new model called FacebookPlayer which is an index. This would be keyed by facebook_id. You would update it whenever you add a new player. It looks something like this:
class FacebookUser(ndb.Model):
player = ndb.KeyProperty(kind='Player', required=True)
Now you can avoid queries altogether. You can do this:
# Build keys from facebook ids.
facebook_id_keys = []
for facebook_id in list_of_facebook_ids:
facebook_id_keys.append(ndb.Key('FacebookPlayer', facebook_id))
keysOfUsersMatchedByFacebookId = []
for facebook_player in ndb.get_multi(facebook_id_keys):
if facebook_player:
keysOfUsersMatchedByFacebookId.append(facebook_player.player)
usersMatchedByFacebookId = ndb.get_multi(keysOfUsersMatchedByFacebookId)
If list_of_facebook_ids is thousands of items, you should do this in batches.

In AppEngine (JDO), what is the difference between equality (==) of an item with list and contains() function?

For example, if I have: List A; and a String B;
What is the difference, in JDO (AppEngine), between the following two conditions in a query: B == A; and A.contains(B);?
Also, does the query in Slides 23-25 of http://dl.google.com/io/2009/pres/W_0415_Building_Scalable_Complex_App_Engines.pdf work efficiently in AppEngine (JDO) for more than 30 receivers? How so, especially since I read in AppEngine documentation that each contains() query can have a maximum of 30 items in the list. Do I not use a contains() query to imitate the above slides (written in Python)? If not, then how can I achieve the same results in JDO?
Any suggestions/comments are highly welcome. I'm trying to build a messaging system in AppEngine but having trouble trying to get used to the platform.
Thanks.

There's no difference - in App Engine, equality checks on lists are the same as checking for containment, due to the way things are indexed in the datastore.
By the query on slides 23-25, I presume you mean this one?
indexes = db.GqlQuery(
"SELECT __key__ FROM MessageIndex "
"WHERE receivers = :1", me)
keys = [k.parent() for k in indexes]
messages = db.get(keys)
This works just fine, as it's a list containment check as described above, and results in a single datastore query. The limitation you're thinking about is on the reverse: if you have a list, and you want to find a record that contains any item in that list, a subquery will be created for each element in the list.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

efficient set operations in google app engine datastore - google-app-engine

Related

How to get the count of property in a Kind?

Query given keys

Projection query with new fields/properites ignores entries that haven't set those properties yet

Does the NDB membership query ("IN" operation) performance degrade with lots of possible values?

In AppEngine (JDO), what is the difference between equality (==) of an item with list and contains() function?

Categories

Resources