About indexes of GAE datastore - database

I have a following model in the GAE app.
class User
school_name = db.StringProperty(Indexed=True)
country = db.StringProperty(Indexed=True)
city = db.StringProperty(Indexed=True)
sex = db.StringProperty(Indexed=True)
profession = db.StringProperty(Indexed=True)
joined_date = db.DateTimeProperty(Indexed=True)
And I want to filter the users by combinations of these fields. Result of the filter should show a user at first who is joined recently. So which means any query end by order operation, I suppose. like that:
User.all().filter('country =','US').filter('profession =','SE').order('-joined_date')
User.all().filter('school_name =','AAA').filter('profession =','SE').order('-joined_date')
....
User.all().filter('sex =','Female').filter('profession =','HR').order('-joined_date')
All these fields combination would be C(5,1)+C(5,2)+...+C(5,5) = 31.
My question is to implement it, do I need to create indexes for all these cases(31) in the Google AppEngine. Or can you suggest other way to implement it?
Note: C(n,k) is combination formula, see more on http://en.wikipedia.org/wiki/Combination
Thanks in advance!

You have several options:
Create all 31 indexes, as you suggest.
Do the sorting in memory. Without a sort order, all your queries can be executed with the built-in merge-join strategy, and so you won't need any indexes at all.
Restrict queries to those that are more likely, or those that eliminate most of the non-matching results, and perform additional filtering in memory.
Put all your data in a ListProperty for indexing as "key:value" strings, and filter only on that. You will need to create multiple indexes with different occurrence counts on that field (eg, indexing it once, twice, etc), and it will result in the same number of index entries, but fewer custom indexes used.

Related

How to get the count of property in a Kind?

I have a Kind Students which stores the details of favorite colors of all students. They are allowed to pick their favorite color from a set of three colors : {Red,Blue,Green}
Let us assume there are 100 students, my code is like this for every student :
Entity arya = new Entity("Student","Arya");
arya.setProperty("Color","Red");
Entity robb = new Entity("Student","Robb");
robb.setProperty("Color","Green");
..
..
Entity jon = new Entity("Student","Jon");
jon.setProperty("Color","Blue");
How to find out how many students liked a particular color(say Red) in this Student Kind ? What Query I should write to fetch the count ?
Thanks in advance
The number you seek would be the number of items in the result of a query with an equality filter on the Color property.
You could use a keys-only query (a special kind of projection query) for this purpose, faster and less expensive:
Keys-only queries
A keys-only query (which is a type of projection query) returns just
the keys of the result entities instead of the entities themselves, at
lower latency and cost than retrieving entire entities.
...
A keys-only query is a small operation and counts as only a single
entity read for the query itself.
Something along these lines (but note that I'm not a java user, the snippet is based only on the documentation examples)
Query<Key> query = Query.newKeyQueryBuilder()
.setKind("Student")
.setFilter(PropertyFilter.eq("Color", "Red")
.build();
I agree with the Dan Cornilescu's answer. Here is a direct Datastore API usage. I have prepared the request body for your use-case. You can run it by just adding your Project Id. This will return the entities that matches with the filter then you can count the number of them.

Azure Search: Order by dynamic data

I have an Azure Search index composed of documents that can "occur" in multiple regions any number of times. For example Document1 has 5 occurrences in Region1, 20 occurrences in Region2. Document2 has 54 occurrences in Region1, and 10 occurrences in Region3. Document3 has 10 occurrences in Region3. We want to use Azure Search for searching and suggestions, but base the order on number of occurrences on a region. For example the search for Document from a user in Region1 should return in the order Document2, Document1, Document3 because Document2 has 54 occurrences in that region, while Document1 has 5 occurrences and Document3 has none.
[
{ 'name': 'Document1', 'regions': ['Region1|5', 'Region2|20'] },
{ 'name': 'Document2', 'regions': ['Region1|54', 'Region3|10'] },
{ 'name': 'Document3', 'regions': ['Region3|10'] }
]
I'm having a hard time figuring out how to structure the index or if it is even possible with Azure Search. Please note that the number of regions is potentially in the hundreds of thousands. I am ok with changing regions for center points and use geospatial functions instead, but I still don't see how to lay the data or query it.
What is the best way to structure the index and how would one make the query possible?
tl;dr - There might be a solution for you based on some assumptions I have. Please read on, and if possible try to provide some validations around my assumptions for me to give a better answer (if such an answer exists).
Unfortunately, Azure search doesn't have an out-of-the box approach for your scenario. There might be a work around however - instead of the regions collection being something like ['Region1|5', 'Region2|20'], you could try to structure the document such that it appears to be ['Region1', 'Region1',...., 'Region2', 'Region2', ...] (that is, make the collection contain n elements of Region1 and m elements of Region2 where in your case n = 5 and m = 10.
Then you should simply be able to search using the Region that the user originates from and I believe the results should be ordered based on which document's collection column (regions) contains more occurrences of the particular queried region.
This approach helps you in 2 ways:
You could try adding each region as a column in the search index and use some queries to get the kind of result you want. However, since you mention there might be hundreds of thousands of such regions, it might not work well with our service limits. If however that's not the case, I highly recommend adding each region as a column, so that you can query/order by the column value.
With the replication of the string approach, you can have arbitrarily large collections as I believe Azure search does not have any limitations with regard to the number of elements in a collection. Also the nice thing here is, if your document will have a sparse number of regions (i.e., you may have 100s of 1000s of regions, but any given document will only have few regions enumerated), you should be able to achieve what you want. If that's not the case however, this approach might not be super nice/efficient and might even be painful for you to manage.
Also, just FYI I'd recommend taking a look at the scoring profiles feature
and especially the tag function to see if that might in any way be useful to you.

Is it possible to re-order query results in memory?

and thanks in advance for any and all help!!
I'm running a query on the datastore that looks like this:
forks = Thing.query(ancestor=user.subscriber_key).filter(
Thing.status==True,
Thing.fork_of==thing_key,
Thing.start_date <= user.day_threshold(),
Thing.level.IN([1,2,3,4,5])).order(
Thing.level)
This query works and returns the results I expect. However, I would like to sort it on one additional field (Thing.last_touched). If I add this to the sort, it won't work because Thing.last_touched is not the property to which the inequality filter is applied. I can't add an additional inequality filter, since we're only allowed one, plus it's not needed (actually, that's why Thing.leve.IN is there.. not needed as a filter, but required for the sort).
So, what I'm wondering is, could I run the query with the filters that I want, and then run code to sort the query results myself? I know I could pull all the parameters I want to sort and store them in dictionaries and sort them that way, but it seems to me there ought to be a way to handle this with the query.
I've searched for days for this but have had no luck.
Just in case you need it, here's the class definition of Thing:
class Thing(ndb.Model):
title = ndb.StringProperty()
level = ndb.IntegerProperty()
fork = ndb.BooleanProperty()
recursion_level = ndb.IntegerProperty()
fork_of = ndb.KeyProperty()
creation_date = ndb.DateTimeProperty(auto_now_add=True)
last_touched = ndb.DateTimeProperty(auto_now=True)
status = ndb.BooleanProperty()
description = ndb.StringProperty()
owner_id = ndb.StringProperty()
frequency = ndb.IntegerProperty()
start_date = ndb.DateTimeProperty(auto_now_add=True)
due_date = ndb.DateTimeProperty()
One of the main reasons that Google AppEngine is so fast even when dealing with insane amounts of data is because of the very limited query options. All standard queries are "scans" over an index, i.e. there is some table (index) that keeps references to your actual data entires in order sorted by ONE of the data's properties. So, let's say you add the following entries:
Thing A: start-date = Wednesday (I'm just going to use weekdays for simplicity)
Thing B: start-date = Friday
Thing C: start-date = Monday
Thing D: start-date = Thursday
Then, AppEngine will create an index that looks like this:
1 - Monday -> Thing C
2 - Wednesday -> Thing A
3 - Thursday -> Thing D
4 - Friday -> Thing B
Now, any query will correspond to a continuous block in this (or another) index. If you, for example, say "All Things with start-date >= Tuesday", it will return entries in row 2 through 4 (i.e. Thing A, Thing D, and Thing B in that exact order!). If you query for "< Thursday", you get 1-2. If you say "> Tuesday and <= Thursday" you get 2-3.
And if you are doing inequality filters on a different property, AppEngine will use a different index.
This is why you can only do one inequality filter and why the sort-order is always also specified by the property that you do an inequality filter of. Because AppEngine is not designed to be able to return items 1, 2, 4 (with a gap*) out of an index, or items 4, 2, 3 (no gap, but out of order).
So, if you need to sort your entries on a different property other than the one you use for inequality filtering, you basically have 3 choices:
Perform your query with the inequality filter, read all results into memory, and sort them in your code afterwards (I think this is what you mean by storing them in a dictionary)
Perform your query WITHOUT the inequality filter, but sorted on the right property. Then, as you loop over the returned entries, simply check the inequality yourself and drop the ones that don't match
Perform your query with the inequality filter and just return the items in the wrong order, and let the client-application worry about sorting them! ;)
Generally I would assume that you have much more unused resources available client-side to do the sorting, so I would probably go for option 3 in most cases. But if you need to sort the entries server-side (e.g. for a mobile-app targeted at older smart-phones), it will depend on the size of your database and the fraction of entries that usually match your inequality filter, whether option 1 or option 2 are better. If your inequality filter only removes a small fraction of the entries, option 2 might be much faster (as it doesn't require any O(>n) sorting), but if you have a huge database of entries and only a very small number of them will match the inequality, definitely go for option 1.
BTW: The talk "App Engine Datastore Under the Covers" from Google I/O 2008 might be a very helpful resource. It's a bit technical, but it gives a great overview of this topic and I consider it must-know information if you want to do anything in AppEngine. Note, though, that this talk is a bit out-dated. There are a bunch more things that you can do with queries now-a-days. But ALL of these extra things (if I understand correctly) are API functions that in the end just generate a set of several simple queries (exactly like the ones described in this talk) and then just combine the results of these in memory in your application (just like you would if you did your own sorting).
*There are some exceptions where AppEngine can generate the intersection of two (or more?) index-scans to drop items from the results, but I don't think that you could use that to change the order of the returned entries.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Datastore fetch on two filters alternative?

I have a datastore entity called Game and two fields in it called playerOne and playerTwo. Either of these fields stores a username.
I need to search on the Game entity and return a MAX of 30 games where the username can be either playerOne OR playerTwo...
So in a relational database you would go:
SELECT * FROM Game WHERE playerOne='username' OR playerTwo='username' LIMIT 30
But in big table you can't filter on more than one field! I can't fetch 10 from one and 10 from the other as the number from each can be variable and in createdDate order.
How would you do this in your datastore?
The quick answer is create a StringListProperty that contains [player_a, player_b] and then simply use the multi-value index made out of that:
games = Game.all().filter("players =", player_find)
You can not do an OR query on the datastore using different fields. If you have to keep your current entity model then you have to do two queries.
1) filtering on playerOne and limiting to 30
2) filtering on playerTwo and limiting to (30 - result size of query one)
Then merge the results in memory to produce the final set of 30.
Now if you also want some ordering by date, then it will get more tricky. However the SQL query you wrote doesn't have any ordering so I omitted it aswell.
However if you can change the entity model then a good way to achive what you want is to have a single field containing a list of both usernames.
Then you can do a simple query in the style of:
SELECT * FROM Game WHERE playerBoth = 'username'

Resources