Scaling issue with a view in sql server 2005 - sql-server

I have an issue with scaling when using a view in MS SQL Server 2005. My defined view returns the data of interest on a small test dataset; however, when I try to return the view on a large dataset (>10000 rows), my query just runs and runs without returning. Is this an error in the efficiency of the view definition, or maybe do I need to include indexing? Are there other ways that I could make the view return data more quickly/efficiently so that it can be practically used?
My current view definition is as follows: -
CREATE VIEW reportedProjects AS
SELECT vc.projectID, vc.sampleID, vc.runID, vc.ngsSubpanel,
va.[HGVS_c], va.[HGVS_p], va.geneID, va.transcriptID,
sc.score, sc.scorer, sc.checker
FROM dbo.varCall vc, dbo.varAnnotation va, dbo.varScore sc
WHERE vc.varCallID = va.varCallID
AND vc.variantID = va.variantID
AND va.variantID = sc.variantID
AND va.transcriptID = sc.transcriptID;
GO
-- Update --
Apologies for any red herrings this may have thrown up for readers. A change of base table columns (not listed in the above problem description) to join on within the view has solved my issue. All suggestions offered were much appreciated though.

Add indexes for varCallID, variantID, and transcriptID; then use joins and see how it performs

Related

Performance issue with django exclude

I have a Django 1.8 application, and I am using an MsSQL database, with pyodbc as the db backend (using "django-pyodbc-azure" module).
I have the following models:
class Branch(models.Model):
name = models.CharField(max_length=30)
startTime = models.DateTimeField()
class Device(models.Model):
uid = models.CharField(max_length=100, primary_key=True)
type = models.CharField(max_length=20)
firstSeen = models.DateTimeField()
lastSeen = models.DateTimeField()
class Session(models.Model):
device = models.ForeignKey(Device)
branch = models.ForeignKey(Branch)
start = models.DateTimeField()
end = models.DateTimeField(null=True, blank=True)
I need to query the session model, and I want to exclude some records with specific device values. So I issue the following query:
sessionCount = Session.objects.filter(branch=branch)
.exclude(device__in=badDevices)
.filter(end__gte=F('start')+timedelta(minutes=30)).count()
badDevices is a pre-filled list of device ids with around 60 items.
badDevices = ['id-1', 'id-2', ...]
This query takes around 1.5 seconds to complete. If I remove the exclude from the query, it takes around 250 miliseconds.
I printed the generated sql for this queryset, and tried it in my database client. There, both versions executed in around 250 miliseconds.
This is the generated SQL:
SELECT [session].[id], [session].[device_id], [session].[branch_id], [session].[start], [session].[end]
FROM [session]
WHERE ([session].[branch_id] = my-branch-id AND
NOT ([session].[device_id] IN ('id-1', 'id-2', 'id-3',...)) AND
DATEPART(dw, [session].[start]) = 1
AND [session].[end] IS NOT NULL AND
[session].[end] >= ((DATEADD(second, 600, CAST([session].[start] AS datetime)))))
So, using the exclude in database level doesn't seem to be affecting the query performance, but in django, the query runs 6 times slower if I add the exclude part. What could be causing this?
The general issue seems to be that django is doing some extra work to prepare the exclude clause. After that step and by the time the SQL has been generated and sent to the database, there isn't anything interesting happening on the django side that could cause such a significant delay.
In your case, one thing that might be causing this is some kind of pre-processing of badDevices. If, for instance, badDevices is a QuerySet then django might be executing the badDevices query just to prepare the actual query's SQL. Possibly something similar might be happening in the case where device has a non-default primary key.
The other thing might delay the SQL preparation is of course django-pyodbc-azure. Maybe it's doing something strange while compiling the query and it becomes a bottleneck.
This is all wild speculation though, so if you're still having this issue then post the Device and Branch models as well, the exact content of badDevices and the SQL generated from the queries. Then maybe some scenarios can be at least eliminated.
EDIT: I think it must be the Device.uid field. Possibly django or pyodbc is getting confused by the non-default primary key and is fetching all the devices while generating the query. Try two things:
Replace device__in with device_id__in, device__pk__in and device__uid__in and check each one again. Maybe a more explicit query will be easier for django to translate into SQL. You can even try replacing branch with branch_id, just in case.
If the above doesn't work, try replacing the exclude expression with a raw SQL where clause:
# add quotes (because of the hyphens) & join
badDevicesIdString = ", ".join(["'%s'" % id for id in badDevices])
# Replaces .exclude()
... .extra(where=['device_id NOT IN (%s)' % badDevicesIdString])
If neither works, then most likely the problem is with the whole query and not just exclude. There are some more options in that case but try the above first and I will update my answer later if necessary.
Just want to share a similar problem that I had with MySQL and exclude clauses performance and how it was fixed.
When running the exclude clause, the list with the "in" lookup was actually a Queryset that I got using values_list method. Checking the exclude query executed by MySQL, the "in" objects were not values but actually another query. This behavior was impacting performance on specific large queries.
To fix that, instead of passing the queryset, I flat it out in a python list of values. By doing that, each value is passed as an argument inside the in lookup and the performance was really improved.

Entity Framework: Max. number of "subqueries"?

My data model has an entity Person with 3 related (1:N) entities Jobs, Tasks and Dates.
My query looks like
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name),
TaskDates = x.Tasks.Select(y => y.Date),
DateInfos = x.Dates.Select(y => y.Info)
}).ToList();
Everything seems to work fine, but the lists JobNames, TaskDates and DateInfos are not all filled.
For example, TaskDates and DateInfos have the correct values, but JobNames stays empty. But when I remove TaskDates from the query, then JobNames is correctly filled.
So it seems that EF can only handle a limited number of these "subqueries"? Is this correct? If so, what is the max. number of these "subqueries" for a single statement? Is there a way to work around these issue without having to make more than one call to the database?
(ps: I'm not entirely sure, but I seem to remember that this query worked in LINQ2SQL - could it be?)
UPDATE
I'm getting crazy about this. I tried to repro the issue from ground up using a fresh, simple project (to post the entire piece of code here, not only an oversimplified example) - and I found I wasn't able to repro it. It still happens within our existing code base (apparently there's more behind this problem, but I cannot share this closed code base, unfortunately).
After hours and hours of playing around I found the weirdest behavior:
It works great when I don't SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; before calling the LINQ statement
It also works great (independent of the above) when I don't use a .Take() to only get the first X rows
It also works great when I add an additional .Where() statements to cut the the number of rows returned from SQL Server
I didn't find any comprehensible reason why I see this behavior, but I started to look at the SQL: Although EF generates the exact same SQL, the execution plan is different when I use READ UNCOMMITTED. It returns more rows on a specific index in the middle of the execution plan, which curiously ends in less rows returned for the entire SQL statement - which in turn results in the missing data, that is the reason for my question to begin with.
This sounds very confusing and unbelievable, I know, but this is the behavior I see. I don't know what else to do, I don't even know what to google for at this point ;-).
I can fix my problem (just don't use READ UNCOMMITTED), but I have no idea why it occurs and if it is a bug or something I don't know about SQL Server. Maybe there's some "magic max number of allowed results in sub-queries" in SQL Server? At least: As far as I can see, it's not an issue with EF itself.
A little late, but does calling ToList() on each subquery produce the required effect?
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name.ToList()),
TaskDates = x.Tasks.Select(y => y.Date).ToList(),
DateInfos = x.Dates.Select(y => y.Info).ToList()
}).ToList();

Query Performance in Access 2007 - drawing on a SQL Server Express backend

I've been banging my head on this issue for a little while now, and decided I should ask for help. I have a table which holds temperature/humidity chart recorder data (currently over 775,000 records) from which I am trying to run a statistical query against it. The problem is that this often will take up to two minutes, and sometimes will not come back at all - causing me to force close the program (Control-Alt-Delete). At first, I didn't have as much of a problem - it was only after I hit the magical 500k records mark that I started getting serious slowdowns, getting progressively worse as more data was compiled and imported into the table.
Here is the query (pass-through):
SELECT dbo.tblRecorderLogs.strAreaAssigned, Min(dbo.tblRecorderLogs.datDateRecorded) AS FirstRecorderDate, Max(dbo.tblRecorderLogs.datDateRecorded) AS LastRecordedDate,
Round(Avg(dbo.tblRecorderLogs.intTempCelsius),2) AS AverageTempC,
Round(Avg(dbo.tblRecorderLogs.intRHRecorded),2) AS AverageRH,
Count(dbo.tblRecorderLogs.strAreaAssigned) AS Records
FROM dbo.tblRecorderLogs
GROUP BY dbo.tblRecorderLogs.strAreaAssigned
ORDER BY dbo.tblRecorderLogs.strAreaAssigned;
Here is the table structure in which the chart data is stored:
idRecorderDataID Number Primary Key
datDateEntered Date/Time (indexed, duplicates OK)
datTimeEntered Date/Time
intTempCelcius Number
intDewPointCelcius Number
intWetBulbCelcius Number
intMixingGPP Number
intRHRecorded Number
strAssetRecorder Text (indexed, duplicates OK)
strAreaAssigned Text (indexed, duplicates OK)
I am trying to write a program which will allow people to pull data from this table based on Area Assigned, as well as start and end dates. With the dataset size I currently have, this kind of report is simply too much for it to handle (it seems) and the machine doesn't ever return an answer. I've had to extend the ODBC timeout to almost 180 seconds in any queries dealing with this table, simply because of the size. I could use some serious help, if people have some. Thank you in advance!
-- Edited 08/13/2012 # 1050 hours --
I have not been able to test the query on the SQL Server due to the fact that the IT department has taken control of the machine in question, and has someone logged into it full-time using the remote management console. I have tried an interim step to lessen the impact of the performance issue, but I am still looking for a permanent solution to this issue.
Interim step:
I created a local table mirroring the structure of the dbo.tblRecorderLogs SQL Server table, to which I do a INSERT INTO using the former SELECT statement as it's subquery. Then any subsequent statistical analysis is drawn from this 'temporary' local table. After the process is complete, the local table is truncated.
-- Edited 08/13/2012 # 1217 hours --
Ran the shown query on the SQL Server Management Console, took 1 minute 38 seconds to complete according to the query timer provided by the console.
-- Edit 08/15/2012 # 1531 hours --
Tried to run query as VBA DoCmd.RunSQL statement to populate a temporary table using the following code:
INSERT INTO tblTempRecorderDataStatsByArea ( strAreaAssigned, datFirstRecord,
datLastRecord, intAveTempC, intAveRH, intRecordCount )
SELECT dbo_tblRecorderLogs.strAreaAssigned, Min(dbo_tblRecorderLogs.datDateRecorded)
AS MinOfdatDateRecorded, Max(dbo_tblRecorderLogs.datDateRecorded) AS MaxOfdatDateRecorded,
Round(Avg(dbo_tblRecorderLogs.intTempCelsius),2) AS AveTempC,
Round(Avg(dbo_tblRecorderLogs.intRHRecorded),2) AS AveRHRecorded,
Count(dbo_tblRecorderLogs.strAreaAssigned) AS CountOfstrAreaAssigned FROM
dbo_tblRecorderLogs GROUP BY dbo_tblRecorderLogs.strAreaAssigned ORDER BY
dbo_tblRecorderLogs.strAreaAssigned
The problem arises when the code is executed, the query takes so long - it encounters Timeout before it finishes. Still hoping for a 'magic bullet' to fix this...
-- Edited 08/20/2012 # 1241 hours --
The only 'quasi' solution I've found is running the failed query repeatedly (sort of priming the pump, as it were) so that when the query is called again by my program - it has a relative chance of actually completing before the ODBC SQL Server driver times out. Basically, a filthy filthy hack - but I don't have a better one to combat this issue.
I've tried creating a view, which works on the server side - but doesn't speed things up.
The proper fields being aggregated are indexed properly, so I can't make any changes there.
I am only pulling information from the database that is immediately useful to user - no 'SELECT * madness' going on here.
I think I am, officially, out of things to try - aside from throwing raw computing horsepower at the problem, which isn't a solution right now as the item isn't live, and I have no budget to procure better hardware. I will post this as an 'answer' and leave it up until Sept 3rd - where if I do not have better answers, I will accept my own answer and accept defeat.
When I've had to run min/max functions on several fields from the same table I've often found it quicker to do each column separately as a subquery in the from line of the main/outer query.
So your query would be like this:
SELECT rLogs1.strAreaAssigned, rLogs1.FirstRecorderDate, rLogs2.LastRecorderDate, rLog3.AverageTempC, rLogs4.AverageRH, rLogs5.Records
FROM (((
(SELECT strAreaAssigned, min(datDateRecorded) as FirstRecorderDate FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs1
inner join
(SELECT strAreaAssigned, Max(datDateRecorded) as LastRecordedDate, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs2
on rLogs1.strAreaAssigned = rLogs2.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Round(Avg(intTempCelsius),2) AS AverageTempC, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs3
on rLogs1.strAreaAssigned = rLogs3.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Round(Avg(intRHRecorded),2) AS AverageRH, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs4
on rLogs1.strAreaAssigned = rLogs4.strAreaAssigned)
inner join
(SELECT strAreaAssigned, Count(strAreaAssigned) AS Records, FROM dbo.tblRecorderLogs GROUP BY strAreaAssigned) rLogs5
on rLogs1.strAreaAssigned = rLogs5.strAreaAssigned
ORDER BY rLogs1.strAreaAssigned;
If you take your query and the one above, copy them into the same query window in SQL Server and run the estimated execution plan you should be able to compare them and see which one works better.

Hitting the 2100 parameter limit (SQL Server) when using Contains()

from f in CUSTOMERS
where depts.Contains(f.DEPT_ID)
select f.NAME
depts is a list (IEnumerable<int>) of department ids
This query works fine until you pass a large list (say around 3000 dept ids) .. then I get this error:
The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Too many parameters were provided in this RPC request. The maximum is 2100.
I changed my query to:
var dept_ids = string.Join(" ", depts.ToStringArray());
from f in CUSTOMERS
where dept_ids.IndexOf(Convert.ToString(f.DEPT_id)) != -1
select f.NAME
using IndexOf() fixed the error but made the query slow. Is there any other way to solve this? thanks so much.
My solution (Guids is a list of ids you would like to filter by):
List<MyTestEntity> result = new List<MyTestEntity>();
for(int i = 0; i < Math.Ceiling((double)Guids.Count / 2000); i++)
{
var nextGuids = Guids.Skip(i * 2000).Take(2000);
result.AddRange(db.Tests.Where(x => nextGuids.Contains(x.Id)));
}
this.DataContext = result;
Why not write the query in sql and attach your entity?
It's been awhile since I worked in Linq, but here goes:
IQuery q = Session.CreateQuery(#"
select *
from customerTable f
where f.DEPT_id in (" + string.Join(",", depts.ToStringArray()) + ")");
q.AttachEntity(CUSTOMER);
Of course, you will need to protect against injection, but that shouldn't be too hard.
You will want to check out the LINQKit project since within there somewhere is a technique for batching up such statements to solve this issue. I believe the idea is to use the PredicateBuilder to break the local collection into smaller chuncks but I haven't reviewed the solution in detail because I've instead been looking for a more natural way to handle this.
Unfortunately it appears from Microsoft's response to my suggestion to fix this behavior that there are no plans set to have this addressed for .NET Framework 4.0 or even subsequent service packs.
https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=475984
UPDATE:
I've opened up some discussion regarding whether this was going to be fixed for LINQ to SQL or the ADO.NET Entity Framework on the MSDN forums. Please see these posts for more information regarding these topics and to see the temporary workaround that I've come up with using XML and a SQL UDF.
I had similar problem, and I got two ways to fix it.
Intersect method
join on IDs
To get values that are NOT in list, I used Except method OR left join.
Update
EntityFramework 6.2 runs the following query successfully:
var employeeIDs = Enumerable.Range(3, 5000);
var orders =
from order in Orders
where employeeIDs.Contains((int)order.EmployeeID)
select order;
Your post was from a while ago, but perhaps someone will benefit from this. Entity Framework does a lot of query caching, every time you send in a different parameter count, that gets added to the cache. Using a "Contains" call will cause SQL to generate a clause like "WHERE x IN (#p1, #p2.... #pn)", and bloat the EF cache.
Recently I looked for a new way to handle this, and I found that you can create an entire table of data as a parameter. Here's how to do it:
First, you'll need to create a custom table type, so run this in SQL Server (in my case I called the custom type "TableId"):
CREATE TYPE [dbo].[TableId] AS TABLE(
Id[int] PRIMARY KEY
)
Then, in C#, you can create a DataTable and load it into a structured parameter that matches the type. You can add as many data rows as you want:
DataTable dt = new DataTable();
dt.Columns.Add("id", typeof(int));
This is an arbitrary list of IDs to search on. You can make the list as large as you want:
dt.Rows.Add(24262);
dt.Rows.Add(24267);
dt.Rows.Add(24264);
Create an SqlParameter using the custom table type and your data table:
SqlParameter tableParameter = new SqlParameter("#id", SqlDbType.Structured);
tableParameter.TypeName = "dbo.TableId";
tableParameter.Value = dt;
Then you can call a bit of SQL from your context that joins your existing table to the values from your table parameter. This will give you all records that match your ID list:
var items = context.Dailies.FromSqlRaw<Dailies>("SELECT * FROM dbo.Dailies d INNER JOIN #id id ON d.Daily_ID = id.id", tableParameter).AsNoTracking().ToList();
You could always partition your list of depts into smaller sets before you pass them as parameters to the IN statement generated by Linq. See here:
Divide a large IEnumerable into smaller IEnumerable of a fix amount of item

How do I search a "Property Bag" table in SQL?

I have a basic "property bag" table that stores attributes about my primary table "Card." So when I want to start doing some advanced searching for cards, I can do something like this:
SELECT dbo.Card.Id, dbo.Card.Name
FROM dbo.Card
INNER JOIN dbo.CardProperty ON dbo.CardProperty.IdCrd = dbo.Card.Id
WHERE dbo.CardProperty.IdPrp = 3 AND dbo.CardProperty.Value = 'Fiend'
INTERSECT
SELECT dbo.Card.Id, dbo.Card.Name
FROM dbo.Card
INNER JOIN dbo.CardProperty ON dbo.CardProperty.IdCrd = dbo.Card.Id
WHERE (dbo.CardProperty.IdPrp = 10 AND (dbo.CardProperty.Value = 'Wind' OR dbo.CardProperty.Value = 'Fire'))
What I need to do is to extract this idea into some kind of stored procedure, so that ideally I can pass in a list of property/value combinations and get the results of the search.
Initially this is going to be a "strict" search meaning that the results must match all elements in the query, but I'd also like to have a "loose" query so that it would match any of the results in the query.
I can't quite seem to wrap my head around this one. My previous version of this was to do generate some massive SQL query to execute with a lot of AND/OR clauses in it, but I'm hoping to do something a little more elegant this time. How do I go about doing this?
it seems to me that you have an EAV model here.
if you're using sql server 2005 and up i'd suggest you use XML datatype for this:
http://weblogs.sqlteam.com/mladenp/archive/2006/10/14/14032.aspx
makes searching and stuff much easier with built in xml querying capabilities.
if you can't change your model then look at this:
http://weblogs.sqlteam.com/davidm/articles/12117.aspx

Resources