Performance issue with django exclude - sql-server

I have a Django 1.8 application, and I am using an MsSQL database, with pyodbc as the db backend (using "django-pyodbc-azure" module).
I have the following models:
class Branch(models.Model):
name = models.CharField(max_length=30)
startTime = models.DateTimeField()
class Device(models.Model):
uid = models.CharField(max_length=100, primary_key=True)
type = models.CharField(max_length=20)
firstSeen = models.DateTimeField()
lastSeen = models.DateTimeField()
class Session(models.Model):
device = models.ForeignKey(Device)
branch = models.ForeignKey(Branch)
start = models.DateTimeField()
end = models.DateTimeField(null=True, blank=True)
I need to query the session model, and I want to exclude some records with specific device values. So I issue the following query:
sessionCount = Session.objects.filter(branch=branch)
.exclude(device__in=badDevices)
.filter(end__gte=F('start')+timedelta(minutes=30)).count()
badDevices is a pre-filled list of device ids with around 60 items.
badDevices = ['id-1', 'id-2', ...]
This query takes around 1.5 seconds to complete. If I remove the exclude from the query, it takes around 250 miliseconds.
I printed the generated sql for this queryset, and tried it in my database client. There, both versions executed in around 250 miliseconds.
This is the generated SQL:
SELECT [session].[id], [session].[device_id], [session].[branch_id], [session].[start], [session].[end]
FROM [session]
WHERE ([session].[branch_id] = my-branch-id AND
NOT ([session].[device_id] IN ('id-1', 'id-2', 'id-3',...)) AND
DATEPART(dw, [session].[start]) = 1
AND [session].[end] IS NOT NULL AND
[session].[end] >= ((DATEADD(second, 600, CAST([session].[start] AS datetime)))))
So, using the exclude in database level doesn't seem to be affecting the query performance, but in django, the query runs 6 times slower if I add the exclude part. What could be causing this?

The general issue seems to be that django is doing some extra work to prepare the exclude clause. After that step and by the time the SQL has been generated and sent to the database, there isn't anything interesting happening on the django side that could cause such a significant delay.
In your case, one thing that might be causing this is some kind of pre-processing of badDevices. If, for instance, badDevices is a QuerySet then django might be executing the badDevices query just to prepare the actual query's SQL. Possibly something similar might be happening in the case where device has a non-default primary key.
The other thing might delay the SQL preparation is of course django-pyodbc-azure. Maybe it's doing something strange while compiling the query and it becomes a bottleneck.
This is all wild speculation though, so if you're still having this issue then post the Device and Branch models as well, the exact content of badDevices and the SQL generated from the queries. Then maybe some scenarios can be at least eliminated.
EDIT: I think it must be the Device.uid field. Possibly django or pyodbc is getting confused by the non-default primary key and is fetching all the devices while generating the query. Try two things:
Replace device__in with device_id__in, device__pk__in and device__uid__in and check each one again. Maybe a more explicit query will be easier for django to translate into SQL. You can even try replacing branch with branch_id, just in case.
If the above doesn't work, try replacing the exclude expression with a raw SQL where clause:
# add quotes (because of the hyphens) & join
badDevicesIdString = ", ".join(["'%s'" % id for id in badDevices])
# Replaces .exclude()
... .extra(where=['device_id NOT IN (%s)' % badDevicesIdString])
If neither works, then most likely the problem is with the whole query and not just exclude. There are some more options in that case but try the above first and I will update my answer later if necessary.

Just want to share a similar problem that I had with MySQL and exclude clauses performance and how it was fixed.
When running the exclude clause, the list with the "in" lookup was actually a Queryset that I got using values_list method. Checking the exclude query executed by MySQL, the "in" objects were not values but actually another query. This behavior was impacting performance on specific large queries.
To fix that, instead of passing the queryset, I flat it out in a python list of values. By doing that, each value is passed as an argument inside the in lookup and the performance was really improved.

Related

increase performance of a linq query using contains

I have a winforms app where I have a Telerik dropdownchecklist that lets the user select a group of state names.
Using EF and the database is stored in Azure SQL.
The code then hits a database of about 17,000 records and filters the results to only include states that are checked.
Works fine. I am wanting to update a count on the screen whenever they change the list box.
This is the code, in the itemCheckChanged event:
var states = stateDropDownList.CheckedItems.Select(i => i.Value.ToString()).ToList();
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop).ToArray();
ExportInfo_tb.Text = "Current Stop Count: " + filteredStops.Count();
It works, but it is slow.
I tried to load everything into a memory variable then querying that vs the database but can't seem to figure out how to do that.
Any suggestions?
Improvement:
I picked up a noticeable improvement by limiting the amount of data coming down by:
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop.Stop_state).ToList();
And better yet --
int count = (from stop in aDb.Stop_address_details where
states.Contains(stop.Stop_state)
select stop).Count();
ExportInfo_tb.Text = "Current Stop Count: " + count.ToString();
The performance of you query, actually, has nothing to do with Contiains, in this case. Contains is pretty performant. The problem, as you picked up on in your third solution, is that you are pulling far more data over the network than required.
In your first solution you are pulling back all of the rows from the server with the matching stop state and performing the count locally. This is the worst possible approach. You are pulling back data just to count it and you are pulling back far more data than you need.
In your second solution you limited the data coming back to a single field which is why the performance improved. This could have resulted in a significant improvement if your table is really wide. The problem with this is that you are still pulling back all the data just to count it locally.
In your third solution EF will translate the .Count() method into a query that performs the count for you. So the count will happen on the server and the only data returned is a single value; the result of count. Since network latency CAN often be (but is not always) the longest step when performing a query, returning less data can often result in significant gains in query speed.
The query translation of your final solution should look something like this:
SELECT COUNT(*) AS [value]
FROM [Stop_address_details] AS [t0]
WHERE [t0].[Stop_state] IN (#p0)

Entity Framework: Max. number of "subqueries"?

My data model has an entity Person with 3 related (1:N) entities Jobs, Tasks and Dates.
My query looks like
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name),
TaskDates = x.Tasks.Select(y => y.Date),
DateInfos = x.Dates.Select(y => y.Info)
}).ToList();
Everything seems to work fine, but the lists JobNames, TaskDates and DateInfos are not all filled.
For example, TaskDates and DateInfos have the correct values, but JobNames stays empty. But when I remove TaskDates from the query, then JobNames is correctly filled.
So it seems that EF can only handle a limited number of these "subqueries"? Is this correct? If so, what is the max. number of these "subqueries" for a single statement? Is there a way to work around these issue without having to make more than one call to the database?
(ps: I'm not entirely sure, but I seem to remember that this query worked in LINQ2SQL - could it be?)
UPDATE
I'm getting crazy about this. I tried to repro the issue from ground up using a fresh, simple project (to post the entire piece of code here, not only an oversimplified example) - and I found I wasn't able to repro it. It still happens within our existing code base (apparently there's more behind this problem, but I cannot share this closed code base, unfortunately).
After hours and hours of playing around I found the weirdest behavior:
It works great when I don't SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; before calling the LINQ statement
It also works great (independent of the above) when I don't use a .Take() to only get the first X rows
It also works great when I add an additional .Where() statements to cut the the number of rows returned from SQL Server
I didn't find any comprehensible reason why I see this behavior, but I started to look at the SQL: Although EF generates the exact same SQL, the execution plan is different when I use READ UNCOMMITTED. It returns more rows on a specific index in the middle of the execution plan, which curiously ends in less rows returned for the entire SQL statement - which in turn results in the missing data, that is the reason for my question to begin with.
This sounds very confusing and unbelievable, I know, but this is the behavior I see. I don't know what else to do, I don't even know what to google for at this point ;-).
I can fix my problem (just don't use READ UNCOMMITTED), but I have no idea why it occurs and if it is a bug or something I don't know about SQL Server. Maybe there's some "magic max number of allowed results in sub-queries" in SQL Server? At least: As far as I can see, it's not an issue with EF itself.
A little late, but does calling ToList() on each subquery produce the required effect?
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name.ToList()),
TaskDates = x.Tasks.Select(y => y.Date).ToList(),
DateInfos = x.Dates.Select(y => y.Info).ToList()
}).ToList();

Django Query: Annotate with `count` of a *window*

I search for a query which is pretty similar to this one. But as an extension, I do not want to count all objects, but just over the ones, that are fairly recent.
In my case, there are two models. Let one be the Source and one be the Data. As result I'd like to get a list of all Sources ordered by the number of data records, that has been collected during the last week.
For me it is not iteresting, how many data records have been collected in total, but if there is a recent activity of that source.
Using the following code snippet from the above link, I cannot make up how to subquery the Data Table before.
from django.db.models import Count
activity_per_source = Source.objects.annotate(count_data_records=Count('Data')) \
.order_by('-count_data_records')
The only ways I came up with, would be to write native SQL or to process this in a loop and individual queries. Is there a Django-Query version?
(I use a MySQL database and Django 1.5.4)
Checkout out the docs on the order of annotate and filter: https://docs.djangoproject.com/en/1.5/topics/db/aggregation/#order-of-annotate-and-filter-clauses
Try something along the lines of:
activity_per_source = Source.objects.\
filter(data__date__gte=one_week_ago).\
annotate(count_data_records=Count('Data')).\
order_by('-count_data_records').distinct()
There is a way of doing that mixing Django queries with SQL via extra:
start_date = datetime.date.today() - 7
activity_per_source = (
Source.objects
.extra(where=["(select max(date) from app_data where source_id=app_source.id) >= '%s'"
% start_date.strftime('%Y-%m-%d')])
.annotate(count_data_records=Count('Data'))
.order_by('-count_data_records'))
The where part will filter the Sources by its Data last date.
Note: replace table and field names with actual ones.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Django hits the database for each filter() call

I have some Django 1.3 code that looks up many model instances in a loop, ie.
my_set = myinstance.subitem_set.all()
for value in values:
existing = my_set.filter(attr_name=value)
if len(existing) == 1:
...
This works, but profiling SQL queries shows that it hits the DB on each iteration. According to https://docs.djangoproject.com/en/1.3/ref/models/querysets/ iterating over the related items should eagerly load them, so I tried calling:
list(my_set)
However, this doesn't help. It does do a query to load all the sub-items, but then it still does an individual query for each sub-item inside the loop. How do I get it to use the cached set and not hit the DB each time? The DB is PostgreSQL 8.4.
The problem is in this line:
if len(existing) == 1:
From Django documentation:
len(). A QuerySet is evaluated when you call len() on it. This, as you might expect, returns the length of the result list.
Note: Don't use len() on QuerySets if all you want to do is determine the number of records in the set. It's much more efficient to handle a count at the database level, using SQL's SELECT COUNT(*), and Django provides a count() method for precisely this reason. See count() below.
So in your case it executes the query each time when you call len(existing). The more effective way is:
existing.count() == 1
This will also hit the database each time you call it but it will execute SELECT COUNT(*) which is faster.

Resources