Django Queryset Construction and Update - django-models

I have a model with 2 foreign keys to the same model.
class Team(models.Model):
name= models.CharField()
class Game(models.Model):
team_1= models.ForeignKey(Team)
team_2= models.ForeignKey(Team)
date = models.DateTimeField()
team_1_post_game_rating = models.DecimalField()
team_2_post_game_rating = models.DecimalField()
If there is a result, a calculation is done updating the rating based on the result and the rankings of both teams. No problems so far, except if a result is edited after other games have been played.
What i need to be able to do (In the most efficient way possible) is find all the games and update the rankings for all teams that were played by either team subsequent to the game that was edited and any team that those teams played and so on.
I could probably do it using a sub query and values list and iterating on the results but of course Id rather find a nice clean way to construct it dynamically and get away without my database going into meltdown.

Related

Google Datastore - Search Optimization Technique

I am dealing with a real-estate app. A Home will hvae typical properties like Price, Bed Rooms, Bath Rooms, SqFt, Lot size etc. User will search for Homes and such a query will require multiple inequality filters like: Price between x and y, rooms greater than z, bathrooms more than p... etc.
I know that multiple inequality filters are not allowed. I also do not want to perform any filtering in my code and/because I want to be able to use Cursors.
so I have come up with two solutions. I am not sure if these are right - so wonder if gurus can shed some light
Solution 1: I will discretize the values of each attribute and save them in a list-field, then use IN. For example: If there are 3 bed rooms, instead of storing beds=3, I will store beds = [1,2,3]. Now if a user searches for homes with say at least two bedrooms, then instead of writing the filter as beds>2, I will write the filter as "beds IN [2]" - and my home above [1,2,3] will qualify - so so will any home with 2 beds [1,2] or 4 beds [1,2,3,4] and so on
Solution 2: It is similar to the first one but instead of creating a list-property, I will actually add attributed (columns) to the home. So a home with 3 bed rooms will have the following attributed/columns/properties: col-bed-1:true, col-bed-2:true, col-bed-3:true. Now if a user searches for homes with say at least two bedrooms, then instead of writing the filter as beds>2, I will write the filter as "col-bed-2 = true" - and my home will qualify - so will any home with 2 beds, 3 beds, 4 beds and so on
I know both solutions will work, but I want to know:
1. Which one is better both from a performance and google pricing perspective
2. Is there a better solution to do this?
I do almost exactly your use case with a python gae app that lists posts with housing advertisements (similar to craigslist). I wrote it in python and searching with a filter is working and straightforward.
You should choose a language: Python, Java or Go, and then use the Google Search API (that has built-in filtering for equalities or inequalities) and build datastore indexes that you can query using the search API.
For instance, you can use a python class like the following to populate the datastore and then use the Search API.
class Home(db.Model):
address = db.StringProperty(verbose_name='address')
number_of_rooms = db.IntegerProperty()
size = db.FloatProperty()
added = db.DateTimeProperty(verbose_name='added', auto_now_add=True) # readonly
last_modified = db.DateTimeProperty(required=True, auto_now=True)
timestamp = db.DateTimeProperty(auto_now=True) #
image_url = db.URLProperty();
I definitely think that you should avoid storing permutations for several reasons: Permutations can explode in size and makes the code difficult to read. Instead you should do like I did and find examples where someone else has already solved an equal or similar problem.
This appengine demo might help you.

Tracking item order for storage to and retrieval from a DB

I'm trying to figure out how I'm going to 'CRUD' the order of items I have in a group that I'm storing in a database. (Pseudo statement of: select * items from app where group_id = 1;)
My guess is that I just use an numeric field and just increase/decrease the number as more items are added to/removed from the group. I can then just update the items number in this field as they are moved around. However, I've seen this go really badly wrong in an old legacy app where items would get out of sync and you'd have a group where the order ended up something like this:
0,1,1,3,4,5
0,1,1,1,4,5
This wasn't handled very gracefully by the application either, and broke the application necessitating manual intervention to reorder the items in the DB.
Is there a way to avoid this pitfall?
EDIT: I would also maybe want the items available in multiple groups with multiple orders.
I think in that case I would need a many to many relationship for both the group to item relationship and the item to order relationship. /EDIT
I'll be doing this in the Django framework.
I'm not really sure what you are asking; because ordering is one thing, and grouping of related objects is something else entirely.
Databases don't store the order of things, but rather the relationships (grouping) of things. The order of things is a user interface detail and not something that a database should be used for.
In django, you can create a ManyToMany relationship. This essentially creates a "box" where you can add and remove items that are related to a particular model. Here is the example from the documentation:
from django.db import models
class Publication(models.Model):
title = models.CharField(max_length=30)
# On Python 3: def __str__(self):
def __unicode__(self):
return self.title
class Meta:
ordering = ('title',)
class Article(models.Model):
headline = models.CharField(max_length=100)
publications = models.ManyToManyField(Publication)
# On Python 3: def __str__(self):
def __unicode__(self):
return self.headline
class Meta:
ordering = ('headline',)
Here an Article can belong to many Publications, and Publications have one or more Articles associated with them:
a = Article.create(headline='Hello')
b = Article.create(headline='World')
p = Publication.create(title='My Publication')
p.article_set.add(a)
p.article_set.add(b)
p.save()
# You can also add an article to a publication from the article object:
c = Article.create(headline='The Answer is 42')
c.publications.add(p)
To know how many articles belong to a publication:
Publication.objects.get(title='My Publication').article_set.count()

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Should the size of entities be as small as possible when I count them by "count()" method?

I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

How to order by the field stored in the separate model?

Here is simplified version of my datastore structure:
class News(db.Model):
title = db.StringProperty()
class NewsRating(db.Model):
user = db.IntegerProperty()
rating = db.IntegerProperty()
news = db.ReferenceProperty(News)
Now I need to display all news sorted by their total rating (sum of different users ratings). How can I do that in the following code:
news = News.all()
# filter by additional parms
# news.filter("city =", "1")
news.order("-added") # ?
for one_news in news:
self.response.out.write(one_news.title()+'<br>')
Queries only have access to the entity you're querying against, if you have a property from another entity (or some aggregate calculation based on fields from other entities) that you want to use to order results, you're going to need to store it in the entity you're querying against.
In the case of ratings, that might mean a periodic task that sums up ratings and distributes them to articles.
To do that you would need to run a query fetching every single NewsRating referencing your News entity and sum all the ratings (as the datastore does not provide JOINs). This will be a huge task both time and cost wise. I'd recommend to take a look at just-overheard-it example as a reference point.

Resources