Entity Framework Core: performance issue

Entity Framework Core: performance issue - sql-server

I have some problems with EF Core. Every time when I write some linq in C# for getting data from the database, it adds a useless select * from statement. I can't figure out why it does this.
The raw SQL query works pretty quickly - 100ms vs 300ms using linq
This is the method in C#**:
return (from pr in _db.ex_DocumentExt1_PR
from doc in _db.ex_Document.Where(doc => doc.DOCID == pr.DOCID).DefaultIfEmpty()
from docAc in _db.ex_DOCAction.Where(docAc => docAc.DOCID == pr.DOCID).DefaultIfEmpty()
from st in _db.ex_Status.Where(st => st.STATUS_ID == doc.DOC_STATUS).DefaultIfEmpty()
from dep in _db.SSO_Entities.Where(dep => dep.Type == SSO_EntityTypes.COMPANY_STRUCTURE && dep.EntityCode == pr.RequestedForDepartamentId.ToString()).DefaultIfEmpty()
where docAc.ISPERFORMED == 1
&& docAc.ACTOR_ID == uid
&& doc.DOC_NUMBER != "YENI"
&& doc.DOC_NUMBER.Contains(searchText)
group new { doc, st, dep, docAc }
by new { doc.DOCID, doc.DOC_NUMBER, st.SHORT_NAME, dep.DisplayName, docAc.ACTION_PERFORMED } into g1
orderby g1.Key.ACTION_PERFORMED descending
select new LastActiveDocumentViewModel
{
DocId = g1.Key.DOCID,
DocNumber = g1.Key.DOC_NUMBER,
DocStatus = g1.Key.SHORT_NAME,
DocType = DocumentType.PR.ToString(),
Supplier = g1.Key.DisplayName,
Date = g1.Max(g => g.docAc.ACTION_PERFORMED)
});
This is SQL query generated by EF Core:
SELECT TOP (50)
[Project1].[C2] AS [C1],
[Project1].[DOCID] AS [DOCID],
[Project1].[DOC_NUMBER] AS [DOC_NUMBER],
[Project1].[SHORT_NAME] AS [SHORT_NAME],
[Project1].[C3] AS [C2],
[Project1].[DisplayName] AS [DisplayName],
[Project1].[C1] AS [C3]
FROM ( SELECT
[GroupBy1].[A1] AS [C1],
[GroupBy1].[K1] AS [DOCID],
[GroupBy1].[K2] AS [DOC_NUMBER],
[GroupBy1].[K3] AS [ACTION_PERFORMED],
[GroupBy1].[K4] AS [SHORT_NAME],
[GroupBy1].[K5] AS [DisplayName],
1 AS [C2],
N'PR' AS [C3]
FROM ( SELECT
[Filter1].[DOCID1] AS [K1],
[Filter1].[DOC_NUMBER] AS [K2],
[Filter1].[ACTION_PERFORMED] AS [K3],
[Filter1].[SHORT_NAME] AS [K4],
[Extent5].[DisplayName] AS [K5],
MAX([Filter1].[ACTION_PERFORMED]) AS [A1]
FROM (SELECT [Extent1].[RequestedForDepartamentId] AS [RequestedForDepartamentId], [Extent2].[DOCID] AS [DOCID1], [Extent2].[DOC_NUMBER] AS [DOC_NUMBER], [Extent3].[ACTOR_ID] AS [ACTOR_ID], [Extent3].[ACTION_PERFORMED] AS [ACTION_PERFORMED], [Extent4].[SHORT_NAME] AS [SHORT_NAME]
FROM [dbo].[ex_DocumentExt1_PR] AS [Extent1]
LEFT OUTER JOIN [dbo].[ex_Document] AS [Extent2] ON [Extent2].[DOCID] = [Extent1].[DOCID]
INNER JOIN [dbo].[ex_DOCAction] AS [Extent3] ON [Extent3].[DOCID] = CAST( [Extent1].[DOCID] AS bigint)
LEFT OUTER JOIN [dbo].[ex_Status] AS [Extent4] ON [Extent4].[STATUS_ID] = [Extent2].[DOC_STATUS]
WHERE ( NOT (('YENI' = [Extent2].[DOC_NUMBER]) AND ([Extent2].[DOC_NUMBER] IS NOT NULL))) AND (1 = [Extent3].[ISPERFORMED]) ) AS [Filter1]
LEFT OUTER JOIN [dbo].[SSO_Entities] AS [Extent5] ON ('COMPANY_STRUCTURE' = [Extent5].[Type]) AND (([Extent5].[EntityCode] = (CASE WHEN ([Filter1].[RequestedForDepartamentId] IS NULL) THEN N'' ELSE CAST( [Filter1].[RequestedForDepartamentId] AS nvarchar(max)) END)) OR (([Extent5].[EntityCode] IS NULL) AND (CASE WHEN ([Filter1].[RequestedForDepartamentId] IS NULL) THEN N'' ELSE CAST( [Filter1].[RequestedForDepartamentId] AS nvarchar(max)) END IS NULL)))
WHERE ([Filter1].[ACTOR_ID] = 1018) AND ([Filter1].[DOC_NUMBER] LIKE '%%' ESCAPE '~')
GROUP BY [Filter1].[DOCID1], [Filter1].[DOC_NUMBER], [Filter1].[ACTION_PERFORMED], [Filter1].[SHORT_NAME], [Extent5].[DisplayName]
) AS [GroupBy1]
) AS [Project1]
ORDER BY [Project1].[ACTION_PERFORMED] DESC
This is the raw SQL query I wrote that does the same thing as the Linq query:
SELECT TOP(50)
doc.DOCID,
doc.DOC_NUMBER,
'PR',
st.SHORT_NAME,
dep.DisplayName,
MAX(docAc.ACTION_PERFORMED)
FROM ex_DocumentExt1_PR pr
LEFT JOIN ex_Document doc ON doc.DOCID = pr.DOCID
LEFT JOIN ex_DOCAction docAc ON docAc.DOCID = doc.DOCID
LEFT JOIN ex_Status st ON st.STATUS_ID = doc.DOC_STATUS
LEFT JOIN SSO_Entities dep ON dep.Type = 'COMPANY_STRUCTURE' AND dep.EntityCode = pr.RequestedForDepartamentId
WHERE docAc.ISPERFORMED = 1
AND docAc.ACTOR_ID = 1018
AND doc.DOC_NUMBER != 'Yeni'
GROUP BY doc.DOCID, doc.DOC_NUMBER, st.SHORT_NAME, dep.DisplayName
ORDER BY MAX(docAc.ACTION_PERFORMED) DESC

EF is not intended to be a wrapper for SQL. I don't see any "SELECT *" in the generated SQL, though what you will encounter is a range of inner SELECT statements that EF builds to allow it to join tables that normally don't have established references to one another. This is a necessary evil for EF to be able to query across data based on how you want to relate them.
EF's strength is simplifying data access when working with properly normalized data structures where those relationships can be resolved either through convention or configuration. I don't agree that EF doesn't handle multiple tables "well", it can handle them quite quickly provided they are properly related and indexed. The reality though is that many data systems out there in the wild do not follow proper normalization and you end up with the need to query across loosely related data. EF can do it, but it won't be the most efficient at it.
If this is a new project / database whether leveraging Code First or Schema First, my recommendation would be to establish properly nomalized relationships with FKs and indexes/constraints between the tables.
If this is an existing database where you don't have the option to modify the schema then I would recommend employing a View to bind a desired entity model from where you can employ a more directly optimized SQL expression to get the data you want. This would be a distinct set of entities as opposed to the per-table entities that you would use to update data. The goal being larger, open-ended read operations with loose relationships leading to expensive queries can be optimized down, where update operations which should be "touching" far fewer records at a time can be managed via the table-based entities.

The queries don't look identical. For example, your query groups by 4 columns, whilst EF query groups by 5 - it has [Filter1].[ACTION_PERFORMED] in its group by clause, in addition to the other four. Depending on your testing data sample, they might behave similarly, but generally the results will differ.
As #allmhuran has noted in the comments, EF has a tendency to generate inefficient queries, especially when more than 2 tables are involved. Personally, when I find myself in such a situation, I create a database view, put the query there, add the view to the DbContext and select directly from it. In extreme scenarios, that might even be a stored procedure. But that's me, I know SQL much better than C#, always use Database First approach and have my database in an accompanying SSDT project.
If you use EF Code First and employ EF Migrations, adding a view might be a bit of a problem, but it should be possible. This question might be a good start.

Related

How to improve poor performance of EF Core SQL query that sorts on a child collection

My issue is with the queries that EF Core generates for fetching ordered items from a child collection of a parent.
I have a parent class which has a collection of child objects. I'm using Entity Framework Core 5.0.5 (code first) against a SQL Server database. I've tried to boil down the scenario, so let's call it an Owner with a collection of Pets.
I often want a list of owners with their oldest pet, so I'll do something like
Context.Owners
.Select(owner =>
new {
Owner = owner,
OldPet = owner.Pets.OrderBy(pet => pet.Age).LastOrDefault()
})
.Where(owner.Id == 1);
This worked fine before (on ef6) and works functionally now. However, the issue I have is that now EF Core translates these sub collection queries into something apparently cleverer, something like
SELECT *
FROM [Owners] AS [c]
LEFT JOIN (
SELECT *
FROM (
SELECT [c0].[Id] ... , ROW_NUMBER() OVER(PARTITION BY [c0].[OwnerId] ORDER BY [c0].[Age] DESC) AS [row]
FROM [Pets] AS [c0]
) AS [t]
WHERE [t].[row] <= 1
) AS [t0] ON [c].[Id] = [t0].[OwnerId]
The problem I'm having is that it seems to perform terribly. Looking at the execution plan it's doing a clustered index seek on the pets table, then sorting them. The 'number of rows read' is massive and the 'sorting' takes tens or hundreds of milliseconds.
The way EF6 does the same functionality seemed way more performant in this sort of scenario.
Is there a way to change the behaviour so I can choose? Or a way to rewrite this type of query such that I don't have this problem? I've tried many variations of using GroupBy etc and still have the same result.

If you are doing FirstOrDefault in projection, EF Core has to create such join, which uses Window Function ROW_NUMBER. To get desired SQL it is better to rewrite your query to be more predictable for LINQ translator:
var query =
from owner in Context.Owners
from pet in owner.Pets
where owner.Id == 1
orderby pet.Age descending
select new
{
Owner = owner,
OldPet = pet
}
var result = query.FirstOrDefault();

Adding Conditional around query increases time by over 2400%

Update: I will get query plan as soon as I can.
We had a poor performing query that took 4 minutes for a particular organization. After the usual recompiling the stored proc and updating statistics didn't help, we re-wrote the if Exists(...) to a select count(*)... and the stored procedure when from 4 minutes to 70 milliseconds. What is the problem with the conditional that makes a 70 ms query take 4 minutes? See the examples
These all take 4+ minutes:
if (
SELECT COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL)) > 0
print 'records';
-
IF (EXISTS(
SELECT *
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL))
print 'records'
This all take 70 milliseconds:
Declare #recordCount INT;
SELECT #recordCount = COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL);
IF(#recordCount > 0)
print 'records';
It doesn't make sense to me why moving the exact same Count(*) query into an if statement causes such degradation or why 'Exists' is slower than Count. I even tried the exists() in a select CASE WHEN Exists() and it is still 4+ minutes.

Given that my previous answer was mentioned, I'll try to explain again because these things are pretty tricky. So yes, I think you're seeing the same problem as the other question. Namely a row goal issue.
So to try and explain what's causing this I'll start with the three types of joins that are at the disposal of the engine (and pretty much categorically): Loop Joins, Merge Joins, Hash Joins. Loop joins are what they sound like, a nested loop over both sets of data. Merge Joins take two sorted lists and move through them in lock-step. And Hash joins throw everything in the smaller set into a filing cabinet and then look for items in the larger set once the filing cabinet has been filled.
So performance wise, loop joins require pretty much no set up and if you're only looking for a small amount of data they're really optimal. Merge are the best of the best as far as join performance for any data size, but require data to be already sorted (which is rare). Hash Joins require a fair amount of setup but allow large data sets to be joined quickly.
Now we get to your query and the difference between COUNT(*) and EXISTS/TOP 1. So the behavior you're seeing is that the optimizer thinks that rows of this query are really likely (you can confirm this by planning the query without grouping and seeing how many records it thinks it will get in the last step). In particular it probably thinks that for some table in that query, every record in that table will appear in the output.
"Eureka!" it says, "if every row in this table ends up in the output, to find if one exists I can do the really cheap start-up loop join throughout because even though it's slow for large data sets, I only need one row." But then it doesn't find that row. And doesn't find it again. And now it's iterating through a vast set of data using the least efficient means at its disposal for weeding through large sets of data.
By comparison, if you ask for the full count of data, it has to find every record by definition. It sees a vast set of data and picks the choices that are best for iterating through that entire set of data instead of just a tiny sliver of it.
If, on the other hand, it really was correct and the records were very well correlated it would have found your record with the smallest possible amount of server resources and maximized its overall throughput.

What technique should I use for Optimizing the SQL Query

Hi I have a stored procedure that is used to fetch records while searching. This procedure returns millions of records. However there was a bug found inside the search procedure which also return duplicate records in some scenario when certain condition are met. I have found the error why it was returning duplicate records: Below is the query that is in question:
With cteAutoApprove (AcctID, AutoApproved,DecisionDate)
AS (
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
MIN(awt.dtEnter) AS DecisionDate
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
GROUP BY
awt.AcctID
) A
Join AccountWorkflowTask awt1
on awt1.dtEnter=A.DecisionDate and awt1.AcctID=a.AcctID
),
This CTE was returning duplicate record because of the condition on awt1.dtEnter=A.DecisionDate the dtEnter for some account was exactly same. This is the reason it returned duplicate record.
My question is what should I use to prevent this. I cannot use Distinct here as it will definitely slow down the search procedure. Shall I use Rank or Dense Rank so that it is optimized and the query takes less time to execute the result? Or some other technique? Please help as I am actually stuck here

It does seem like a good candidate for row_number (not rank, with the same dates on the same acctid, you'd still have multiple records)
Obviously I can't test the query here, but winging it:
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
awt.dtEnter AS DecisionDate,
autoEnter,
row_number() over (partition by awt.acctid order by awt.dtEnter) rnr
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
) A
where rnr = 1
This way, the group by is no longer necessary: getting the first date is done by row_number. Neither is the second join, the subquery already contains all the data (and the optimizer is smart enough not to do anything with the rows it doesn't need)
PS. because sql server window functions work incredibly efficient, using row_number instead of the min() - join construction, will most likely gain a performance boost, even if there were no double rows.

Slow running BIDS report query...OR clauses in WHERE branch using IN sub queries

I have a report with 3 sub reports and several queries to optimize in each. The first has several OR clauses in the WHERE branch and the OR's are filtering through IN options which are pulling sub-queries.
I say this mostly from reading this SO post. Specifically LBushkin's second point.
I'm not the greatest at TSQL but I know enough to think this is very inefficient. I think I need to do two things.
I know I need to add indexes to the tables involved.
I think the query can be greatly enhanced.
So it seems that my first step would be to improve the query. From there I can look at what columns and tables are involved and thus determine the indexes.
At this point I haven't posted table schemas as I'm looking more for options / considerations such as using a cte to replace all the IN sub-queries.
If needed I will definitely post whatever would be helpful such as physical reads etc.
SELECT DISTINCT
auth.icm_authorizationid,
auth.icm_documentnumber
FROM
Filteredicm_servicecost AS servicecost
INNER JOIN Filteredicm_authorization AS auth ON
auth.icm_authorizationid = servicecost.icm_authorizationid
INNER JOIN Filteredicm_service AS service ON
service.icm_serviceid = servicecost.icm_serviceid
INNER JOIN Filteredicm_case AS cases ON
service.icm_caseid = cases.icm_caseid
WHERE
(cases.icm_caseid IN
(SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case))
OR (service.icm_serviceid IN
(SELECT icm_serviceid FROM Filteredicm_service AS CRMAF_Filteredicm_service))
OR (servicecost.icm_servicecostid IN
(SELECT icm_servicecostid FROM Filteredicm_servicecost AS CRMAF_Filteredicm_servicecost))
OR (auth.icm_authorizationid IN
(SELECT icm_authorizationid FROM Filteredicm_authorization AS CRMAF_Filteredicm_authorization))

EXISTS is usually much faster than IN as the query engine is able to optimize it better.
Try this:
WHERE EXISTS (SELECT 1 FROM FROM Filteredicm_case WHERE icm_caseid = cases.icm_caseid)
OR EXISTS (SELECT 1 FROM Filteredicm_service WHERE icm_serviceid = service.icm_serviceid)
OR EXISTS (SELECT 1 FROM Filteredicm_servicecost WHERE icm_servicecostid = servicecost.icm_servicecostid)
OR EXISTS (SELECT 1 FROM Filteredicm_authorization WHERE icm_authorizationid = auth.icm_authorizationid)
Furthermore, an index on Filteredicm_case.icm_caseid, an index on Filteredicm_service.icm_serviceid, an index on Filteredicm_servicecost.icm_servicecostid, and an index on Filteredicm_authorization.icm_authorizationid will increase performance of this query. They look like they should be keys already, however, so I suspect that indices already exist.
However, unless I'm misreading, there's no way this WHERE clause will ever evaluate to anything other than true.
The clause you wrote says WHERE cases.icm_caseid IN (SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case). However, cases is an alias to Filteredicm_case. That's the equivalent of WHERE Filteredicm_case.icm_caseid IN (SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case). That will be true as long as Filteredicm_case.icm_caseid isn't NULL.
The same error in logic exists for the remaining portions in the WHERE clause:
(service.icm_serviceid IN (SELECT icm_serviceid FROM Filteredicm_service AS CRMAF_Filteredicm_service))
service is an alias for Filteredicm_service. This is always true as long as icm_serviceid is not null
(servicecost.icm_servicecostid IN (SELECT icm_servicecostid FROM Filteredicm_servicecost AS CRMAF_Filteredicm_servicecost))
servicecost is an alias for Filteredicm_servicecost. This is always true as long as icm_servicecostid is not null.
(auth.icm_authorizationid IN (SELECT icm_authorizationid FROM Filteredicm_authorization AS CRMAF_Filteredicm_authorization))
auth is an alias for Filteredicm_authorization. This is always true as long as icm_authorizationid is not null.
I don't understand what you're trying to accomplish.

SQL Server do I need two queries and a function efficiency question

I want to get a list of people affiliated with a blog. The table [BlogAffiliates] has:
BlogID
UserID
Privelage
and if the persons associated with that blog have a lower or equal privelage they cannot edit [bit field canedit].
Is this query the most efficient way of doing this or are there better ways to derive this information??
I wonder if it can be done in a single query??
Can it be done without that convert in some more clever way?
declare #privelage tinyint
select #privelage = (select Privelage from BlogAffiliates
where UserID=#UserID and BlogID = #BlogID)
select aspnet_Users.UserName as username,
BlogAffiliates.Privelage as privelage,
Convert(Bit, Case When #privelage> blogaffiliates.privelage
Then 1 Else 0 End) As canedit
from BlogAffiliates, aspnet_Users
where BlogAffiliates.BlogID = #BlogID and BlogAffiliates.Privelage >=2
and aspnet_Users.UserId = BlogAffiliates.UserID

Some of this would depend on the indexs and the size of the tables involved. If for example your most costly portion of the query when you profiled it was a seek on the "BlogAffiliates.BlogID" column, then you could do one select into a table variable and then do both calculations from there.
However I think most likely the query you have stated is probably going to be close the the most efficient. The only possible work duplication is you are seeking twice on the "BlogAffiliates.BlogID" fields because of the two queries.

You can try below query.
Select aspnet_Users.UserName as username, Blog.Privelage as privelage,
Convert(Bit, Case When #privelage> Blog.privelage
Then 1 Else 0 End) As canedit
From
(
Select UserID, Privelage
From BlogAffiliates
Where BlogID = #BlogID and Privelage >= 2
)Blog
Inner Join aspnet_Users on aspnet_Users.UserId = Blog.UserID
As per my understanding you should not use Table variable, in case you are joining it with other table. This can reduce the performance. But in case the records are less, then you should go for it. You can also use Local temporary tables for this purpose.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight