Entity Framework: Max. number of "subqueries"? - sql-server

My data model has an entity Person with 3 related (1:N) entities Jobs, Tasks and Dates.
My query looks like
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name),
TaskDates = x.Tasks.Select(y => y.Date),
DateInfos = x.Dates.Select(y => y.Info)
}).ToList();
Everything seems to work fine, but the lists JobNames, TaskDates and DateInfos are not all filled.
For example, TaskDates and DateInfos have the correct values, but JobNames stays empty. But when I remove TaskDates from the query, then JobNames is correctly filled.
So it seems that EF can only handle a limited number of these "subqueries"? Is this correct? If so, what is the max. number of these "subqueries" for a single statement? Is there a way to work around these issue without having to make more than one call to the database?
(ps: I'm not entirely sure, but I seem to remember that this query worked in LINQ2SQL - could it be?)
UPDATE
I'm getting crazy about this. I tried to repro the issue from ground up using a fresh, simple project (to post the entire piece of code here, not only an oversimplified example) - and I found I wasn't able to repro it. It still happens within our existing code base (apparently there's more behind this problem, but I cannot share this closed code base, unfortunately).
After hours and hours of playing around I found the weirdest behavior:
It works great when I don't SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; before calling the LINQ statement
It also works great (independent of the above) when I don't use a .Take() to only get the first X rows
It also works great when I add an additional .Where() statements to cut the the number of rows returned from SQL Server
I didn't find any comprehensible reason why I see this behavior, but I started to look at the SQL: Although EF generates the exact same SQL, the execution plan is different when I use READ UNCOMMITTED. It returns more rows on a specific index in the middle of the execution plan, which curiously ends in less rows returned for the entire SQL statement - which in turn results in the missing data, that is the reason for my question to begin with.
This sounds very confusing and unbelievable, I know, but this is the behavior I see. I don't know what else to do, I don't even know what to google for at this point ;-).
I can fix my problem (just don't use READ UNCOMMITTED), but I have no idea why it occurs and if it is a bug or something I don't know about SQL Server. Maybe there's some "magic max number of allowed results in sub-queries" in SQL Server? At least: As far as I can see, it's not an issue with EF itself.

A little late, but does calling ToList() on each subquery produce the required effect?
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name.ToList()),
TaskDates = x.Tasks.Select(y => y.Date).ToList(),
DateInfos = x.Dates.Select(y => y.Info).ToList()
}).ToList();

Related

Performance issue with django exclude

I have a Django 1.8 application, and I am using an MsSQL database, with pyodbc as the db backend (using "django-pyodbc-azure" module).
I have the following models:
class Branch(models.Model):
name = models.CharField(max_length=30)
startTime = models.DateTimeField()
class Device(models.Model):
uid = models.CharField(max_length=100, primary_key=True)
type = models.CharField(max_length=20)
firstSeen = models.DateTimeField()
lastSeen = models.DateTimeField()
class Session(models.Model):
device = models.ForeignKey(Device)
branch = models.ForeignKey(Branch)
start = models.DateTimeField()
end = models.DateTimeField(null=True, blank=True)
I need to query the session model, and I want to exclude some records with specific device values. So I issue the following query:
sessionCount = Session.objects.filter(branch=branch)
.exclude(device__in=badDevices)
.filter(end__gte=F('start')+timedelta(minutes=30)).count()
badDevices is a pre-filled list of device ids with around 60 items.
badDevices = ['id-1', 'id-2', ...]
This query takes around 1.5 seconds to complete. If I remove the exclude from the query, it takes around 250 miliseconds.
I printed the generated sql for this queryset, and tried it in my database client. There, both versions executed in around 250 miliseconds.
This is the generated SQL:
SELECT [session].[id], [session].[device_id], [session].[branch_id], [session].[start], [session].[end]
FROM [session]
WHERE ([session].[branch_id] = my-branch-id AND
NOT ([session].[device_id] IN ('id-1', 'id-2', 'id-3',...)) AND
DATEPART(dw, [session].[start]) = 1
AND [session].[end] IS NOT NULL AND
[session].[end] >= ((DATEADD(second, 600, CAST([session].[start] AS datetime)))))
So, using the exclude in database level doesn't seem to be affecting the query performance, but in django, the query runs 6 times slower if I add the exclude part. What could be causing this?
The general issue seems to be that django is doing some extra work to prepare the exclude clause. After that step and by the time the SQL has been generated and sent to the database, there isn't anything interesting happening on the django side that could cause such a significant delay.
In your case, one thing that might be causing this is some kind of pre-processing of badDevices. If, for instance, badDevices is a QuerySet then django might be executing the badDevices query just to prepare the actual query's SQL. Possibly something similar might be happening in the case where device has a non-default primary key.
The other thing might delay the SQL preparation is of course django-pyodbc-azure. Maybe it's doing something strange while compiling the query and it becomes a bottleneck.
This is all wild speculation though, so if you're still having this issue then post the Device and Branch models as well, the exact content of badDevices and the SQL generated from the queries. Then maybe some scenarios can be at least eliminated.
EDIT: I think it must be the Device.uid field. Possibly django or pyodbc is getting confused by the non-default primary key and is fetching all the devices while generating the query. Try two things:
Replace device__in with device_id__in, device__pk__in and device__uid__in and check each one again. Maybe a more explicit query will be easier for django to translate into SQL. You can even try replacing branch with branch_id, just in case.
If the above doesn't work, try replacing the exclude expression with a raw SQL where clause:
# add quotes (because of the hyphens) & join
badDevicesIdString = ", ".join(["'%s'" % id for id in badDevices])
# Replaces .exclude()
... .extra(where=['device_id NOT IN (%s)' % badDevicesIdString])
If neither works, then most likely the problem is with the whole query and not just exclude. There are some more options in that case but try the above first and I will update my answer later if necessary.
Just want to share a similar problem that I had with MySQL and exclude clauses performance and how it was fixed.
When running the exclude clause, the list with the "in" lookup was actually a Queryset that I got using values_list method. Checking the exclude query executed by MySQL, the "in" objects were not values but actually another query. This behavior was impacting performance on specific large queries.
To fix that, instead of passing the queryset, I flat it out in a python list of values. By doing that, each value is passed as an argument inside the in lookup and the performance was really improved.

increase performance of a linq query using contains

I have a winforms app where I have a Telerik dropdownchecklist that lets the user select a group of state names.
Using EF and the database is stored in Azure SQL.
The code then hits a database of about 17,000 records and filters the results to only include states that are checked.
Works fine. I am wanting to update a count on the screen whenever they change the list box.
This is the code, in the itemCheckChanged event:
var states = stateDropDownList.CheckedItems.Select(i => i.Value.ToString()).ToList();
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop).ToArray();
ExportInfo_tb.Text = "Current Stop Count: " + filteredStops.Count();
It works, but it is slow.
I tried to load everything into a memory variable then querying that vs the database but can't seem to figure out how to do that.
Any suggestions?
Improvement:
I picked up a noticeable improvement by limiting the amount of data coming down by:
var filteredStops = (from stop in aDb.Stop_address_details where states.Contains(stop.Stop_state) select stop.Stop_state).ToList();
And better yet --
int count = (from stop in aDb.Stop_address_details where
states.Contains(stop.Stop_state)
select stop).Count();
ExportInfo_tb.Text = "Current Stop Count: " + count.ToString();
The performance of you query, actually, has nothing to do with Contiains, in this case. Contains is pretty performant. The problem, as you picked up on in your third solution, is that you are pulling far more data over the network than required.
In your first solution you are pulling back all of the rows from the server with the matching stop state and performing the count locally. This is the worst possible approach. You are pulling back data just to count it and you are pulling back far more data than you need.
In your second solution you limited the data coming back to a single field which is why the performance improved. This could have resulted in a significant improvement if your table is really wide. The problem with this is that you are still pulling back all the data just to count it locally.
In your third solution EF will translate the .Count() method into a query that performs the count for you. So the count will happen on the server and the only data returned is a single value; the result of count. Since network latency CAN often be (but is not always) the longest step when performing a query, returning less data can often result in significant gains in query speed.
The query translation of your final solution should look something like this:
SELECT COUNT(*) AS [value]
FROM [Stop_address_details] AS [t0]
WHERE [t0].[Stop_state] IN (#p0)

App engine datastore inconsistent?

This is so weird...
First of all this query works in the datastore viewer, ie. it returns the correct row.
SELECT * FROM Level where short_id = 'Ec71eN'
But if I run this
Level.all().filter("short_id = ", 'Ec71eN').get()
it returns None, if I run this:
db.GqlQuery("SELECT * FROM Level where short_id = '%s'" % 'Ec71eN').get()
it also returns None. If I run this:
level = Level.get_by_id(189009)
it returns the correct row (189009 is the id for the correct row)
Puzzling? What can be wrong here? I have never seen anything like this before, it has worked correctly for at least a couple of weeks in production... I think I have at least two cases now where it dosent work starting today.
UPDATE: This can not be a eventually consistent problem since the row was 7 hours old when I tried the above. I had two rows with same symptoms, strangely booth generated by the same users. They where booth "fixed" after I did a manual fecth of their ids by uploading special case code like:
if short_id==CASE_1_SHORT_ID:
level = Level.get_by_id(CASE_1_ID)
After that the query worked as usual.
Are you using the HRD? Nothing's wrong. You know it's supposed to be eventually consistent right?
Query operations are eventually consistent.
Get-by-id operations are fully consistent.
What you describe is correct datastore behavior. It's a bit odd that the datastore viewer operation returns the correct result, but it might have hit a separate tablet on the datastore operation.
Given that it was created 7 hours ago, the 'eventual consistency' generally should take seconds to minutes.
If eventual consistency IS the problem, run the same query method a bunch of times and see if returns the same result. If it continuously returns the same result with the same method, then it is more than likely not an eventual consistency problem. You should switch to the NDB API for querying data as well - it's 1000 times better and Guido worked on it - so you know it's good. Does NDB show the same inconsistency?

Google App Engine - Delete until count() <= 0

What is the difference between these 2 pieces of code?
query=Location.all(keys_only=True)
while query.count()>0:
db.delete(query.fetch(5))
# --
while True:
query=Location.all(keys_only=True)
if not query.count():
break
db.delete(query.fetch(5))
They both work.
Logically, these two pieces of code perform the same exact thing - they delete every Location entity, 5 at a time.
The first piece of code is better both in terms of style and (slightly) in terms of performance. (The query itself does not need to be rebuilt in each loop).
However, this code is not as efficient as it could be. It has several problems:
You use count() but do not need to. It would be more efficient to simply fetch the entities, and then test the results to see if you got any.
You are making more round-trips to the datastore than you need to. Each count(), fetch(), and delete() call must go the datastore and back. These round-trips are slow, so you should try to minimize them. You can do this by fetching more entities in each loop.
Example:
q = Location.all(keys_only=True)
results = q.fetch(500)
while results:
db.delete(results)
results = q.fetch(500)
Edit: Have a look at Nick's answer below - he explains why this code's performance can be improved even more by using query cursors.
Here's a solution that's neater, but you may or may not consider to be a hack:
q = Location.all(keys_only=True)
for batch in iter(lambda: q.fetch(500), []):
db.delete(batch)
One gotcha, however, is that as you delete more and more, the backend is forced to skip over the 'tombstoned' entities to find the next ones that aren't deleted. Here's a more efficient solution that uses cursors:
q = Location.all(keys_only=True)
results = q.fetch(500)
while results:
db.delete(results)
q = Location.all(keys_only=True).with_cursor(q.cursor())
results = q.fetch(500)
In the second one, query will be assigned/updated in every loop. I don't know if this is needed with the logic behind it (I don't use google app engine). To replicate this behaviour, the first one would have to look like this:
query=Location.all(keys_only=True)
while query.count()>0:
db.delete(query.fetch(5))
query=Location.all(keys_only=True)
In my oppinion, the first style is way more readable than the second one.
It's too bad you can't do this in python.
query=Location.all(keys_only=True)
while locations=query.fetch(5):
db.delete(locations)
Like in the other P language
while(#row=$sth->fetchrow_array){
do_something();
}

Adding a projection to an NHibernate criteria stops it from performing default entity selection

I'm writing an NHibernate criteria that selects data supporting paging. I'm using the COUNT(*) OVER() expression from SQL Server 2005(+) to get hold of the total number of available rows, as suggested by Ayende Rahien. I need that number to be able to calculate how many pages there are in total. The beauty of this solution is that I don't need to execute a second query to get hold of the row count.
However, I can't seem to manage to write a working criteria (Ayende only provides an HQL query).
Here's an SQL query that shows what I want and it works just fine. Note that I intentionally left out the actual paging logic to focus on the problem:
SELECT Items.*, COUNT(*) OVER() AS rowcount
FROM Items
Here's the HQL:
select
item, rowcount()
from
Item item
Note that the rowcount() function is registered in a custom NHibernate dialect and resolves to COUNT(*) OVER() in SQL.
A requirement is that the query is expressed using a criteria. Unfortunately, I don't know how to get it right:
var query = Session
.CreateCriteria<Item>("item")
.SetProjection(
Projections.SqlFunction("rowcount", NHibernateUtil.Int32));
Whenever I add a projection, NHibernate doesn't select item (like it would without a projection), just the rowcount() while I really need both. Also, I can't seem to project item as a whole, only it's properties and I really don't want to list all of them.
I hope someone has a solution to this. Thanks anyway.
I think it is not possible in Criteria, it has some limits.
You could get the id and load items in a subsequent query:
var query = Session
.CreateCriteria<Item>("item")
.SetProjection(Projections.ProjectionList()
.Add(Projections.SqlFunction("rowcount", NHibernateUtil.Int32))
.Add(Projections.Id()));
If you don't like it, use HQL, you can set the maximal number of results there too:
IList<Item> result = Session
.CreateQuery("select item, rowcount() from item where ..." )
.SetMaxResult(100)
.List<Item>();
Use CreateMultiCriteria.
You can execute 2 simple statements with only one hit to the DB that way.
I am wondering why using Criteria is a requirement. Can't you use session.CreateSQLQuery? If you really must do it in one query, I would have suggested pulling back the Item objects and the count, like:
select {item.*}, count(*) over()
from Item {item}
...this way you can get back Item objects from your query, along with the count. If you experience a problem with Hibernate's caching, you can also configure the query spaces (entity/table caches) associated with a native query so that stale query cache entries will be cleared automatically.
If I understand your question properly, I have a solution. I struggled quite a bit with this same problem.
Let me quickly describe the problem I had, to make sure we're on the same page. My problem came down to paging. I want to display 10 records in the UI, but I also want to know the total number of records that matched the filter criteria. I wanted to accomplish this using the NH criteria API, but when adding a projection for row count, my query no longer worked, and I wouldn't get any results (I don't remember the specific error, but it sounds like what you're getting).
Here's my solution (copy & paste from my current production code). Note that "SessionError" is the name of the business entity I'm retrieving paged data for, according to 3 filter criterion: IsDev, IsRead, and IsResolved.
ICriteria crit = CurrentSession.CreateCriteria(typeof (SessionError))
.Add(Restrictions.Eq("WebApp", this));
if (isDev.HasValue)
crit.Add(Restrictions.Eq("IsDev", isDev.Value));
if (isRead.HasValue)
crit.Add(Restrictions.Eq("IsRead", isRead.Value));
if (isResolved.HasValue)
crit.Add(Restrictions.Eq("IsResolved", isResolved.Value));
// Order by most recent
crit.AddOrder(Order.Desc("DateCreated"));
// Copy the ICriteria query to get a row count as well
ICriteria critCount = CriteriaTransformer.Clone(crit)
.SetProjection(Projections.RowCountInt64());
critCount.Orders.Clear();
// NOW add the paging vars to the original query
crit = crit
.SetMaxResults(pageSize)
.SetFirstResult(pageNum_oneBased * pageSize);
// Set up a multi criteria to get your data in a single trip to the database
IMultiCriteria multCrit = CurrentSession.CreateMultiCriteria()
.Add(crit)
.Add(critCount);
// Get the results
IList results = multCrit.List();
List<SessionError> sessionErrors = new List<SessionError>();
foreach (SessionError sessErr in ((IList)results[0]))
sessionErrors.Add(sessErr);
numResults = (long)((IList)results[1])[0];
So I create my base criteria, with optional restrictions. Then I CLONE it, and add a row count projection to the CLONED criteria. Note that I clone it before I add the paging restrictions. Then I set up an IMultiCriteria to contain the original and cloned ICriteria objects, and use the IMultiCriteria to execute both of them. Now I have my paged data from the original ICriteria (and I only dragged the data I need across the wire), and also a raw count of how many actual records matched my criteria (useful for display or creating paging links, or whatever). This strategy has worked well for me. I hope this is helpful.
I would suggest investigating custom result transformer by calling SetResultTransformer() on your session.
Create a formula property in the class mapping:
<property name="TotalRecords" formula="count(*) over()" type="Int32" not-null="true"/>;
IList<...> result = criteria.SetFirstResult(skip).SetMaxResults(take).List<...>();
totalRecords = (result != null && result.Count > 0) ? result[0].TotalRecords : 0;
return result;

Resources