I have a model:
class MyModel(db.Model):
ts = db.DateTimeProperty(auto_now_add=True)
id_from_other_source = db.StringProperty(default='')
#some data
Now I have a list of some ids which match id_from_other_source field.
Data about id changes in time, so for one id there is a lot of data.
I'd like to run such a query that fetches me for each id only one entry of that id that is the youngest.
Something like:
MyModel.all().filter('id_from_other_source IN', my_id_list).order('-ts').fetch(1000)
But with disctinction in id_from_other_source. I understand that you can't run GQL queries with DISTINCT, but maybe you can see any solution that won't run too much queries?
One solution is to take them one by one and fetch the result, but I'd really like to do it with lesser number of queries.
Related
I have an object Contract that has a look-up to another object Indexationtype. I have another object IndexationEntry that has master-detail to Indexationtype. Now I would like to get the value of the percentage field in the IndexationEntry onto Contract based on the yer fields. The year in the IndexationEntry matches Year in Contract. How should I achieve this?
From Contract "up" to IndexationType__c, then "down" to IndexationEntry__c?
If there's no direct link between them it's not going to be pretty. One way would be something like this
SELECT Id, Name,
(SELECT Id, ContractNumber FROM Contracts__r WHERE Year__c = '2021'),
(SELECT Id, Percent__c FROM IndexationEntries__r WHERE Year__c = '2021')
FROM IndexationType__c
You'd have to run it once for each year. Or (since you tagged it Apex) maybe you can prepare the reference data a bit, query Indexation Types + Entries and build something like Map<Id, Map<Integer, IndexationEntry__c>> (1st key is by Indexation Type Id, then by year). Query them, populate the Map, then loop through your contracts and use map.get() to fetch your values.
I want to know how efficient this filter can be done with django queries. Essentially I have the followig two clases
class Act(models.Model):
Date = models.DateTimeField()
Doc = models.ForeignKey(Doc)
...
class Doc(models.Model):
...
so one Doc can have severals Acts, and for each Doc I want to get the act with the greater Date. I'm only interested in Acts objects.
For example, if a have
act1 = (Date=2021-01-01, Doc=doc1)
act2 = (Date=2021-01-02, Doc=doc1)
act3 = (Date=2021-01-03, Doc=doc2)
act4 = (Date=2021-01-04, Doc=doc2)
act5 = (Date=2021-01-05, Doc=doc2)
I want to get [act2, act5] (the Act with Doc=doc1 with the greater Date and the Act with Doc=doc2 with the greater Date).
My only solution is to make a for over Docs.
Thank you so much
You can do this with one or two queries: the first query will retrieve the latest Act per Doc, and then the second one will then retrieve the acts:
from django.db.models import OuterRef, Subquery
last_acts = Doc.objects.annotate(
latest_act=Subquery(
Act.objects.filter(
Doc_id=OuterRef('pk')
).values('pk').order_by('-Date')[:1]
)
).values('latest_act')
and then we can retrieve the corresponding Acts:
Act.objects.filter(pk__in=last_acts)
depending on the database, it might be more efficient to first retrieve the primary keys, and then make an extra query:
Act.objects.filter(pk__in=list(last_acts))
I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.
Here is simplified version of my datastore structure:
class News(db.Model):
title = db.StringProperty()
class NewsRating(db.Model):
user = db.IntegerProperty()
rating = db.IntegerProperty()
news = db.ReferenceProperty(News)
Now I need to display all news sorted by their total rating (sum of different users ratings). How can I do that in the following code:
news = News.all()
# filter by additional parms
# news.filter("city =", "1")
news.order("-added") # ?
for one_news in news:
self.response.out.write(one_news.title()+'<br>')
Queries only have access to the entity you're querying against, if you have a property from another entity (or some aggregate calculation based on fields from other entities) that you want to use to order results, you're going to need to store it in the entity you're querying against.
In the case of ratings, that might mean a periodic task that sums up ratings and distributes them to articles.
To do that you would need to run a query fetching every single NewsRating referencing your News entity and sum all the ratings (as the datastore does not provide JOINs). This will be a huge task both time and cost wise. I'd recommend to take a look at just-overheard-it example as a reference point.
I have a datastore entity called Game and two fields in it called playerOne and playerTwo. Either of these fields stores a username.
I need to search on the Game entity and return a MAX of 30 games where the username can be either playerOne OR playerTwo...
So in a relational database you would go:
SELECT * FROM Game WHERE playerOne='username' OR playerTwo='username' LIMIT 30
But in big table you can't filter on more than one field! I can't fetch 10 from one and 10 from the other as the number from each can be variable and in createdDate order.
How would you do this in your datastore?
The quick answer is create a StringListProperty that contains [player_a, player_b] and then simply use the multi-value index made out of that:
games = Game.all().filter("players =", player_find)
You can not do an OR query on the datastore using different fields. If you have to keep your current entity model then you have to do two queries.
1) filtering on playerOne and limiting to 30
2) filtering on playerTwo and limiting to (30 - result size of query one)
Then merge the results in memory to produce the final set of 30.
Now if you also want some ordering by date, then it will get more tricky. However the SQL query you wrote doesn't have any ordering so I omitted it aswell.
However if you can change the entity model then a good way to achive what you want is to have a single field containing a list of both usernames.
Then you can do a simple query in the style of:
SELECT * FROM Game WHERE playerBoth = 'username'