Optimizing Entity Framework query - sql-server

I want to get data from a database using 3 tables: questions, answers and authentication_detail. The third tables logs when a certain answer to a certain question is given. I need to compute the number of answer events for all the answers of all the questions in a given time frame to display a statistics chart. I used the following query, but the performance is quite bad. I think I am doing something wrong, but I don't know exactly how else I can proceed.
var qData = questions.Select( qs => qs.Select(q=>new SurveyData
{
id=q.id,
name = q.name,
answers =q.answers.Where(a=>a.is_archived==false).Select(a=>new AnswerDTO
{
id=a.id,
name=a.name,
count= a.authentication_detail
.Where(ad=> ad.authentication.#event.insert_date.HasValue
&& ad.authentication.#event.insert_date.Value >= today
&& ad.authentication.#event.insert_date.Value.Hour == curHour).Count()
}).OrderByDescending(el=>el.count).ToList()
})).AsNoTracking().FirstOrDefault();
I am using Entity Framework 6.4.0.

Ok, so my main problem was I was using nested queries. Someone suggested I fetch all the required data from the db first and perform the operations on that data and it worked. Went down from 15s+ to arround 1-2s.

Related

Entity Framework: Max. number of "subqueries"?

My data model has an entity Person with 3 related (1:N) entities Jobs, Tasks and Dates.
My query looks like
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name),
TaskDates = x.Tasks.Select(y => y.Date),
DateInfos = x.Dates.Select(y => y.Info)
}).ToList();
Everything seems to work fine, but the lists JobNames, TaskDates and DateInfos are not all filled.
For example, TaskDates and DateInfos have the correct values, but JobNames stays empty. But when I remove TaskDates from the query, then JobNames is correctly filled.
So it seems that EF can only handle a limited number of these "subqueries"? Is this correct? If so, what is the max. number of these "subqueries" for a single statement? Is there a way to work around these issue without having to make more than one call to the database?
(ps: I'm not entirely sure, but I seem to remember that this query worked in LINQ2SQL - could it be?)
UPDATE
I'm getting crazy about this. I tried to repro the issue from ground up using a fresh, simple project (to post the entire piece of code here, not only an oversimplified example) - and I found I wasn't able to repro it. It still happens within our existing code base (apparently there's more behind this problem, but I cannot share this closed code base, unfortunately).
After hours and hours of playing around I found the weirdest behavior:
It works great when I don't SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; before calling the LINQ statement
It also works great (independent of the above) when I don't use a .Take() to only get the first X rows
It also works great when I add an additional .Where() statements to cut the the number of rows returned from SQL Server
I didn't find any comprehensible reason why I see this behavior, but I started to look at the SQL: Although EF generates the exact same SQL, the execution plan is different when I use READ UNCOMMITTED. It returns more rows on a specific index in the middle of the execution plan, which curiously ends in less rows returned for the entire SQL statement - which in turn results in the missing data, that is the reason for my question to begin with.
This sounds very confusing and unbelievable, I know, but this is the behavior I see. I don't know what else to do, I don't even know what to google for at this point ;-).
I can fix my problem (just don't use READ UNCOMMITTED), but I have no idea why it occurs and if it is a bug or something I don't know about SQL Server. Maybe there's some "magic max number of allowed results in sub-queries" in SQL Server? At least: As far as I can see, it's not an issue with EF itself.
A little late, but does calling ToList() on each subquery produce the required effect?
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name.ToList()),
TaskDates = x.Tasks.Select(y => y.Date).ToList(),
DateInfos = x.Dates.Select(y => y.Info).ToList()
}).ToList();

Django Model: Best Approach to update?

I am trying to update 100's of objects in my Job which is scheduled every two hours.
I have articles table in my Model. All articles are parsed and then different attributes are saved for each article.
First i query to get all unparsed articles and then parse each URL which is saved against article and save the received attributes.
Below is my code
articles = Articles.objects.filter(status = 0) #100's of articles
for art in articles:
try:
url = art.link
result = ArticleParser(URL) #Custom function which will do all the parsing
art.author = result.articleauthor
art.description = result.articlecontent[:5000]
art.imageurl = result.articleImage
art.status = 1
art.save()
except Exception as e:
art.author = ""
art.description = ""
art.imageurl = ""
art.status = 2
art.save()
The thing is when this job is running CPU utilization is very high also DB process utilization is very high. I am trying to pin point when and where it spikes.
Question: Is this the right way to update multiple objects or is there any better way to do it? Any suggestions.
Appreciate your help.
Regards
Edit 1: Sorry for the confusion. There is some explanation to do. The fields like author, desc etc they will be different for every article they will be returned after i parse the URL. The reason i am updating in loop is because these fields will be different for every iteration according to the URL. I have updated the code i hope it helps clearing the confusion.
You are doing 100s of DB operations in a relatively tight loop, so it is expected that there is some load on the DB.
If you have a lot of articles, make sure you have an index on the status column to avoid a table scan.
You can try disabling autocommit and wrapping the whole update in one transaction instead.
From my understanding, you do NOT want to set the fields author, description and imageurl to same value on all articles, so QuerySet.update won't work for you.
Django recommends this way when you want to update or delete multi-objects: https://docs.djangoproject.com/en/1.6/topics/db/optimization/#use-queryset-update-and-delete
1.Better not to use 'Exception', need to specify concretely: KeyError, IndexError etc.
2.Data can be created once. Something like this:
data = dict(
author=articleauthor,
description=articlecontent[:5000],
imageurl=articleImage,
status=1
)
Articles.objects.filter(status=0).update(**data)
To Edit 1: Probably want to set up a periodic tasks celery. That is, for each query to a separate task. For help see this documentation.

How to increase query speed in db4o?

OutOfMemoryError caused when db4o databse has 15000+ objects
My question is in reference to my previous question (above). For the same PostedMessage model and same query.
With 100,000 PostedMessage objects, the query takes about 1243 ms to return first 20 PostedMessages.
Now, I have saved 1,000,000 PostedMessage objects in db4o. The same query took 342,132 ms. Which is non-linearly high.
How can I optimize the query speed?
FYR:
The timeSent and timeReceived are Indexed fields.
I am using SNAPSHOT query mode.
I am not using TA/TP.
Do you sort the result? Unfortunatly db4o doesn't use the index for sorting / orderBy. That means it will run a regular sort algorith, with O(n*log(n)). It won't scala liniearly.
Also db4o doesn't support a TOP operator. That means even without sorting it takes quite a bit of time to copy the ids to the results set, even when you never read the entities afterwards.
So, there's no real good solution for this, except trying to use some criteria which cut down the result size.
Some adventerous people might use a different query evaluation, but personally don't recommend that.
#Gamlor No, I am not sorting at all. The code is as follows:
public static ObjectSet<PostedMessage> getMessagesBetweenDates(
Calendar after,
Calendar before,
ObjectContainer db) {
if (after == null || before == null || db == null) {
return null;
}
Query q = db.query(); //db is pre-configured to use SNAPSHOT mode.
q.constrain(PostedMessage.class);
Constraint from = q.descend("timeRecieved").constrain(new Long(after.getTimeInMillis())).greater().equal();
q.descend("timeRecieved").constrain(new Long(before.getTimeInMillis())).smaller().equal().and(from);
ObjectSet<EmailMessage> results = q.execute();
return results;
}
The arguments to this method are as follows:
after = 13-09-2011 10:55:55
before = 13-09-2011 10:56:10
And I expect only 10 PostedMessages to be returned between "after" and "before". (I am generating dummy PostedMessage with timeReceived incremented by 1 sec each.)

Is there any way to do a Insert or Update / Merge / Upsert in LLBLGen

I'd like to do an upmerge using LLBLGen without first fetching then saving the entity.
I already found the possibility to update without fetching the entity first, but then I have to know it is already there.
Updating entries would be about as often as inserting a new entry.
Is there a possibility to do this in one step?
Would it make sense to do it in one step?
Facts:
LLBLgen Pro 2.6
SQL Server 2008 R2
.NET 3.5 SP1
I know I'm a little late for this, but As I remember working with LLBLGenPro, it is totally possible and one of its beauties is everithing is possible!
I don't have my samples, but I'm pretty sure you there is a method named UpdateEntitiesDirectly that can be used like this:
// suppose we have Product and Order Entities
using (var daa = new DataAccessAdapter())
{
int numberOfUpdatedEntities =
daa.UpdateEntitiesDirectly(OrderFields.ProductId == 23 && OrderFields.Date > DateTime.Now.AddDays(-2));
}
When using LLBLGenPro we were able to do pretty everything that is possible with an ORM framework, it's just great!
It also has a method to do a batch delete called DeleteEntitiesDirectly that may be usefull in scenarios that you need to delete an etity and replace it with another one.
Hope this is helpful.
I think you can achieve what you're looking for by using EntityCollection. First fetch the entities you want to update by FetchEntityCollection method of DataAccessAdapter then, change anything you want in that collection, insert new entities to it and save it using DataAccessAdapter, SaveCollection method. this way existing entities would be updated and new ones would be inserted to the Database. For example in a product order senario in which you want to manipulate orders of a specified product then you can use something like this:
int productId = 23;
var orders = new EntityCollection<OrderEntity>();
using (DataAccessAdapter daa = new DataAccessAdapter())
{
daa.FetchEntityCollection(orders, new RelationPredicateBucket(OrderFields.ProductId == productId))
foreach(var order in orders)
{
order.State = 1;
}
OrderEntity newOrder = new OrderEntity();
newOrder.ProductId == productId;
newOrder.State = 0;
orders.Add(newOrder);
daa.SaveEntityCollection(orders);
}
As far as I know, this is not possible, and could not be possible.
If you were to just call adapter.Save(entity) on an entity that was not fetched, the framework would assume it was new. If you think about it, how could the framework know whether to emit an UPDATE or an INSERT statement? No matter what, something somewhere would have to query the database to see if the row exists.
It would not be too difficult to create something that did this more or less automatically for single entity (non-recursive) saves. The steps would be something like:
Create a new entity and set it's fields.
Attempt to fetch an entity of the same type using the PK or a unique constraint (there are other options as well, but none as uniform)
If the fetch fails, just save the new entity (INSERT)
If the fetch succeeds, map the fields of the created entity to the fields of the fetched entity.
Save the fetched entity (UPDATE).

Why is my django bulk database population so slow and frequently failing?

I decided I'd like to use django's model system rather than coding raw SQL to interface with my database, but I am having a problem that surely is avoidable.
My models.py contains:
class Student(models.Model):
student_id = models.IntegerField(unique = True)
form = models.CharField(max_length = 10)
preferred = models.CharField(max_length = 70)
surname = models.CharField(max_length = 70)
and I'm populating it by looping through a list as follows:
from models import Student
for id, frm, pref, sname in large_list_of_data:
s = Student(student_id = id, form = frm, preferred = pref, surname = sname)
s.save()
I don't really want to be saving this to the database each time but I don't know another way to get django to not forget about it (I'd rather add all the rows and then do a single commit).
There are two problems with the code as it stands.
It's slow -- about 20 students get updated each second.
It doesn't even make it through large_list_of_data, instead throwing a DatabaseError saying "unable to open database file". (Possibly because I'm using sqlite3.)
My question is: How can I stop these two things from happening? I'm guessing that the root of both problems is that I've got the s.save() but I don't see a way of easily batching the students up and then saving them in one commit to the database.
So it seems I should have looked harder before posing the question.
Some solutions are described in this stackoverflow question (the winning answer is to use django.db.transaction.commit_manually) and also in this one on aggregating saves.
Other ideas for speeding up this type of operation are listed in this stackoverflow question.

Resources