Django Query: Annotate with `count` of a *window* - database

I search for a query which is pretty similar to this one. But as an extension, I do not want to count all objects, but just over the ones, that are fairly recent.
In my case, there are two models. Let one be the Source and one be the Data. As result I'd like to get a list of all Sources ordered by the number of data records, that has been collected during the last week.
For me it is not iteresting, how many data records have been collected in total, but if there is a recent activity of that source.
Using the following code snippet from the above link, I cannot make up how to subquery the Data Table before.
from django.db.models import Count
activity_per_source = Source.objects.annotate(count_data_records=Count('Data')) \
.order_by('-count_data_records')
The only ways I came up with, would be to write native SQL or to process this in a loop and individual queries. Is there a Django-Query version?
(I use a MySQL database and Django 1.5.4)

Checkout out the docs on the order of annotate and filter: https://docs.djangoproject.com/en/1.5/topics/db/aggregation/#order-of-annotate-and-filter-clauses
Try something along the lines of:
activity_per_source = Source.objects.\
filter(data__date__gte=one_week_ago).\
annotate(count_data_records=Count('Data')).\
order_by('-count_data_records').distinct()

There is a way of doing that mixing Django queries with SQL via extra:
start_date = datetime.date.today() - 7
activity_per_source = (
Source.objects
.extra(where=["(select max(date) from app_data where source_id=app_source.id) >= '%s'"
% start_date.strftime('%Y-%m-%d')])
.annotate(count_data_records=Count('Data'))
.order_by('-count_data_records'))
The where part will filter the Sources by its Data last date.
Note: replace table and field names with actual ones.

Related

Groupby and count() with alias and 'normal' dataframe: python pandas versus mssql

Coming from a SQL environment, I am learning some things in Python Pandas. I have a question regarding grouping and aggregates.
Say I group a dataset by Age Category and count the different categories. In MSSQL I would write this:
SELECT AgeCategory, COUNT(*) AS Cnt
FROM TableA
GROUP BY AgeCategory
ORDER BY 1
The result set is a 'normal' table with two columns, the second column I named Count.
When I want to do the equivalent in Pandas, the groupby object is different in format. So now I have to reset the index and rename the column in a following line. My code would look like this:
grouped = df.groupby('AgeCategory')['ColA'].count().reset_index()
grouped.columns = ['AgeCategory', 'Count']
grouped
My question is if this can be accomplished in one go. Seems like I am over-doing it, but I lack experience.
Thanks for any advise.
Regards, M.
Use parameter name in DataFrame.reset_index:
grouped = df.groupby('AgeCategory')['ColA'].count().reset_index(name='Count')
Or:
grouped = df.groupby('AgeCategory').size().reset_index(name='Count')
Difference is GroupBy.count exclude missing values, GroupBy.size not.
More information about aggregation in pandas.

Performance issue with django exclude

I have a Django 1.8 application, and I am using an MsSQL database, with pyodbc as the db backend (using "django-pyodbc-azure" module).
I have the following models:
class Branch(models.Model):
name = models.CharField(max_length=30)
startTime = models.DateTimeField()
class Device(models.Model):
uid = models.CharField(max_length=100, primary_key=True)
type = models.CharField(max_length=20)
firstSeen = models.DateTimeField()
lastSeen = models.DateTimeField()
class Session(models.Model):
device = models.ForeignKey(Device)
branch = models.ForeignKey(Branch)
start = models.DateTimeField()
end = models.DateTimeField(null=True, blank=True)
I need to query the session model, and I want to exclude some records with specific device values. So I issue the following query:
sessionCount = Session.objects.filter(branch=branch)
.exclude(device__in=badDevices)
.filter(end__gte=F('start')+timedelta(minutes=30)).count()
badDevices is a pre-filled list of device ids with around 60 items.
badDevices = ['id-1', 'id-2', ...]
This query takes around 1.5 seconds to complete. If I remove the exclude from the query, it takes around 250 miliseconds.
I printed the generated sql for this queryset, and tried it in my database client. There, both versions executed in around 250 miliseconds.
This is the generated SQL:
SELECT [session].[id], [session].[device_id], [session].[branch_id], [session].[start], [session].[end]
FROM [session]
WHERE ([session].[branch_id] = my-branch-id AND
NOT ([session].[device_id] IN ('id-1', 'id-2', 'id-3',...)) AND
DATEPART(dw, [session].[start]) = 1
AND [session].[end] IS NOT NULL AND
[session].[end] >= ((DATEADD(second, 600, CAST([session].[start] AS datetime)))))
So, using the exclude in database level doesn't seem to be affecting the query performance, but in django, the query runs 6 times slower if I add the exclude part. What could be causing this?
The general issue seems to be that django is doing some extra work to prepare the exclude clause. After that step and by the time the SQL has been generated and sent to the database, there isn't anything interesting happening on the django side that could cause such a significant delay.
In your case, one thing that might be causing this is some kind of pre-processing of badDevices. If, for instance, badDevices is a QuerySet then django might be executing the badDevices query just to prepare the actual query's SQL. Possibly something similar might be happening in the case where device has a non-default primary key.
The other thing might delay the SQL preparation is of course django-pyodbc-azure. Maybe it's doing something strange while compiling the query and it becomes a bottleneck.
This is all wild speculation though, so if you're still having this issue then post the Device and Branch models as well, the exact content of badDevices and the SQL generated from the queries. Then maybe some scenarios can be at least eliminated.
EDIT: I think it must be the Device.uid field. Possibly django or pyodbc is getting confused by the non-default primary key and is fetching all the devices while generating the query. Try two things:
Replace device__in with device_id__in, device__pk__in and device__uid__in and check each one again. Maybe a more explicit query will be easier for django to translate into SQL. You can even try replacing branch with branch_id, just in case.
If the above doesn't work, try replacing the exclude expression with a raw SQL where clause:
# add quotes (because of the hyphens) & join
badDevicesIdString = ", ".join(["'%s'" % id for id in badDevices])
# Replaces .exclude()
... .extra(where=['device_id NOT IN (%s)' % badDevicesIdString])
If neither works, then most likely the problem is with the whole query and not just exclude. There are some more options in that case but try the above first and I will update my answer later if necessary.
Just want to share a similar problem that I had with MySQL and exclude clauses performance and how it was fixed.
When running the exclude clause, the list with the "in" lookup was actually a Queryset that I got using values_list method. Checking the exclude query executed by MySQL, the "in" objects were not values but actually another query. This behavior was impacting performance on specific large queries.
To fix that, instead of passing the queryset, I flat it out in a python list of values. By doing that, each value is passed as an argument inside the in lookup and the performance was really improved.

Determining Difference Between Items On-Hand and Items Required per Project in Access 2003

I'm usually a PHP programmer, but I'm currently working on a project in MS Access 2003 and I'm a complete VBA newbie. I'm trying to do something that I could easily do in PHP but I have no idea how to do it in Access. The facts are as follows:
Tables and relevant fields:
tblItems: item_id, on_hand
tblProjects: project_id
tblProjectItems: project_id, item_id
Goal: Determine which projects I could potentially do, given the items on-hand.
I need to find a way to compare each project's required items against the items on-hand to determine if there are any items missing. If not, add the project to the list of potential projects. In PHP I would compare an array of on-hand items with an array of project items required, using the array_diff function; if no difference, add project_id to an array of potential projects.
For example, if...
$arrItemsOnHand = 1,3,4,5,6,8,10,11,15
$arrProjects[1] = 1,10
$arrProjects[2] = 8,9,12
$arrProjects[3] = 7,13
$arrProjects[4] = 1,3
$arrProjects[5] = 2,14
$arrProjects[6] = 2,5,8,10,11,15
$arrProjects[7] = 2,4,5,6,8,10,11,15
...the result should be:
$arrPotentialProjects = 1,4
Is there any way to do this in Access?
Consider a single query to reach your goal: "Determine which projects I could potentially do, given the items on-hand."
SELECT
pi.project_id,
Count(pi.item_id) AS NumberOfItems,
Sum(IIf(i.on_hand='yes', 1, 0)) AS NumberOnHand
FROM
tblProjectItems AS pi
INNER JOIN tblItems AS i
ON pi.item_id = i.item_id
GROUP BY pi.project_id
HAVING Count(pi.item_id) = Sum(IIf(i.on_hand='yes', 1, 0));
That query computes the number of required items for each project and the number of those items which are on hand.
When those two numbers don't match, that means at least one of the required items for that project is not on hand.
So the HAVING clause excludes those rows from the query result set, leaving only rows where the two numbers match --- those are the projects for which all required items are on hand.
I realize my description was not great. (Sorry.) I think it should make more sense if you run the query both with and without the HAVING clause ... and then read the description again.
Anyhow, if that query gives you what you need, I don't think you need VBA array handling for this. And if you can use that query as your form's RecordSource or as the RowSource for a list or combo box, you may not need VBA at all.

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Adding a projection to an NHibernate criteria stops it from performing default entity selection

I'm writing an NHibernate criteria that selects data supporting paging. I'm using the COUNT(*) OVER() expression from SQL Server 2005(+) to get hold of the total number of available rows, as suggested by Ayende Rahien. I need that number to be able to calculate how many pages there are in total. The beauty of this solution is that I don't need to execute a second query to get hold of the row count.
However, I can't seem to manage to write a working criteria (Ayende only provides an HQL query).
Here's an SQL query that shows what I want and it works just fine. Note that I intentionally left out the actual paging logic to focus on the problem:
SELECT Items.*, COUNT(*) OVER() AS rowcount
FROM Items
Here's the HQL:
select
item, rowcount()
from
Item item
Note that the rowcount() function is registered in a custom NHibernate dialect and resolves to COUNT(*) OVER() in SQL.
A requirement is that the query is expressed using a criteria. Unfortunately, I don't know how to get it right:
var query = Session
.CreateCriteria<Item>("item")
.SetProjection(
Projections.SqlFunction("rowcount", NHibernateUtil.Int32));
Whenever I add a projection, NHibernate doesn't select item (like it would without a projection), just the rowcount() while I really need both. Also, I can't seem to project item as a whole, only it's properties and I really don't want to list all of them.
I hope someone has a solution to this. Thanks anyway.
I think it is not possible in Criteria, it has some limits.
You could get the id and load items in a subsequent query:
var query = Session
.CreateCriteria<Item>("item")
.SetProjection(Projections.ProjectionList()
.Add(Projections.SqlFunction("rowcount", NHibernateUtil.Int32))
.Add(Projections.Id()));
If you don't like it, use HQL, you can set the maximal number of results there too:
IList<Item> result = Session
.CreateQuery("select item, rowcount() from item where ..." )
.SetMaxResult(100)
.List<Item>();
Use CreateMultiCriteria.
You can execute 2 simple statements with only one hit to the DB that way.
I am wondering why using Criteria is a requirement. Can't you use session.CreateSQLQuery? If you really must do it in one query, I would have suggested pulling back the Item objects and the count, like:
select {item.*}, count(*) over()
from Item {item}
...this way you can get back Item objects from your query, along with the count. If you experience a problem with Hibernate's caching, you can also configure the query spaces (entity/table caches) associated with a native query so that stale query cache entries will be cleared automatically.
If I understand your question properly, I have a solution. I struggled quite a bit with this same problem.
Let me quickly describe the problem I had, to make sure we're on the same page. My problem came down to paging. I want to display 10 records in the UI, but I also want to know the total number of records that matched the filter criteria. I wanted to accomplish this using the NH criteria API, but when adding a projection for row count, my query no longer worked, and I wouldn't get any results (I don't remember the specific error, but it sounds like what you're getting).
Here's my solution (copy & paste from my current production code). Note that "SessionError" is the name of the business entity I'm retrieving paged data for, according to 3 filter criterion: IsDev, IsRead, and IsResolved.
ICriteria crit = CurrentSession.CreateCriteria(typeof (SessionError))
.Add(Restrictions.Eq("WebApp", this));
if (isDev.HasValue)
crit.Add(Restrictions.Eq("IsDev", isDev.Value));
if (isRead.HasValue)
crit.Add(Restrictions.Eq("IsRead", isRead.Value));
if (isResolved.HasValue)
crit.Add(Restrictions.Eq("IsResolved", isResolved.Value));
// Order by most recent
crit.AddOrder(Order.Desc("DateCreated"));
// Copy the ICriteria query to get a row count as well
ICriteria critCount = CriteriaTransformer.Clone(crit)
.SetProjection(Projections.RowCountInt64());
critCount.Orders.Clear();
// NOW add the paging vars to the original query
crit = crit
.SetMaxResults(pageSize)
.SetFirstResult(pageNum_oneBased * pageSize);
// Set up a multi criteria to get your data in a single trip to the database
IMultiCriteria multCrit = CurrentSession.CreateMultiCriteria()
.Add(crit)
.Add(critCount);
// Get the results
IList results = multCrit.List();
List<SessionError> sessionErrors = new List<SessionError>();
foreach (SessionError sessErr in ((IList)results[0]))
sessionErrors.Add(sessErr);
numResults = (long)((IList)results[1])[0];
So I create my base criteria, with optional restrictions. Then I CLONE it, and add a row count projection to the CLONED criteria. Note that I clone it before I add the paging restrictions. Then I set up an IMultiCriteria to contain the original and cloned ICriteria objects, and use the IMultiCriteria to execute both of them. Now I have my paged data from the original ICriteria (and I only dragged the data I need across the wire), and also a raw count of how many actual records matched my criteria (useful for display or creating paging links, or whatever). This strategy has worked well for me. I hope this is helpful.
I would suggest investigating custom result transformer by calling SetResultTransformer() on your session.
Create a formula property in the class mapping:
<property name="TotalRecords" formula="count(*) over()" type="Int32" not-null="true"/>;
IList<...> result = criteria.SetFirstResult(skip).SetMaxResults(take).List<...>();
totalRecords = (result != null && result.Count > 0) ? result[0].TotalRecords : 0;
return result;

Resources