SQL Server table indexing for specific LINQ for SQL query - sql-server

I'll be describing the business case first. If you just want the question, please skip a couple of paragraphs ahead...
I'm synchronizing data on a .NET mobile client from an ASP.NET Web API server over the Web. Due to "mobile nature" of the client, I'd like the process to be as efficient as possible, so I'd like to implement an incremental synchronization, meaning that the client asks for new entries from a specified date, which will usually be the last sync date.
I'm dealing with entry deletions separately, so for the sake of simplicity, let's focus on new and modified entries.
The table being synchronized is too large to fit in a single response, so paging is implemented.
Each entry in the table has a unique ID column and a LastUpdated column. On the server, I'm using the following code to respond with the requested page:
var set = Model.Set<T>().Where(t => String.Compare(t.LastUpdated, fromDate, StringComparison.Ordinal) >= 0).OrderBy(t => t.Id);
var queryResultPage = set.Skip(pageSize * pageNumber).Take(pageSize);
return queryResultPage.ToList();
Model.Set is the DbSet from which data is retrieved. Please ignore the fact that I must use strings to represent dates...
My question is, what SQL Server table index(es) would produce optimal performance for this case?

Pleun is exactly right. I did a demo of this for a client recently on the CRM 2011 platform. I showed them a case where a page view was taking ~30 seconds to load a page after sorting through 2.2M records plus an additional 4.5M records.
Using SQL Profiler, you can find the query being run.
Put it into SQL Management Studio (clean it up as necessary to make it standard SQL)
Then run execution plan and look for indexes it suggests (especially ones it says are missing)
Anyway, in my demo to my client, after we finished with this, the query dropped down to less than a second; and the page loaded in about 4 seconds (which is still pitiful).

Related

Ghost data rows added into Firebird database table?

I faced today strange case when receiving customer database for investigation.
System settings:
Firebird server v 2.5.9.26074
Firebird client v 2.6.5
Database file is accessed directly by the application, i.e., it is NOT registered via aliases.conf.
When I first looked into database, everything seemed to be pretty consistent. However, during the first startup there are two rows added in certain table without any detected SQL execution. I have confirmed with debugger that the application is not adding these rows. I also used Audit and Trace inferface (fbtracemgr) and saw in log file that there are not such rows added to the database.
There is one hint that something is wrong in the original database. The table that contains the problem is using INSERT trigger to set the table row's ID column value from generator. Now the generator value seem to be one too high in the original database. This leads me to think that the "ghost data" has already been entered in the file in some sort of cache as the generator is already increment by one.
The result is that after these the two ghost rows are added, the next real addition to the table leads into exception:
FirebirdSql.Data.FirebirdClient.FbException (0x80004005): violation of
PRIMARY or UNIQUE KEY constraint "INTEG_275" on table "DATALOG" --->
violation of PRIMARY or UNIQUE KEY constraint "INTEG_275" on table
"DATALOG"
as there already exist row with equal ID that the generator suggests.
Is there persistent "unsaved data cache" that could contain row data entered during the previous application runs? What could lead to this situation? Power break during database writing or backuping?
Any thoughts?
Firebird server v 2.5.9.26074
There is no such version released.
Firebird-2.5.8.27089
http://www.firebirdsql.org/en/firebird-2-5/
Basically u seem to use some destabilized FB developers internal build, which can have any number of strange averse effects.
So I would advice to use standard released verison or if using snapshot builds is required for some untold reasons - to ask developers in firebird-support mail list - http://www.firebirdsql.org/en/support/
Though don't hold your breath for much of support over exotic Firebird builds.
UPD. Thanks to Mark, here it is: https://www.firebirdsql.org/en/firebird-2-5-0/
2.5.0 - was the first release after a significant reworking of the engine. Not the most stable, obviously. For example there was an issue with indices right in the next 2.5.1 version.
if the behavior would be repeated on standard 2.5.8 Firebird, then i would suggest exporting all the database (at least all the meta-data, but maybe the data as well) into a long text file, SQL script, and then searching for the said table name in it. For example there might be on-database-connect triggers adding some data. Or stored procedures. Or views made on triggers. Or something yet else. For example - though malpractice - even UDF function may make it's own database connection and do things, though this should be shown in FBTrace.
However, during the first startup there are two rows added in certain table
startup of what ?
will those rows still be added if you use standard tools like iSQL/FlameRobin/IBExpert/etc just to connect and then disconnect from the database?
as there already exist row with equal ID that the generator suggests
Generator can not suggest things like that. It can only suggest that once such a number was reserved for possibly being added to one or another table. It does not mean the row was actually inserted, was inserted into that table, was not deleted later.
You may try to search with indices prohibited, in case index corruption could occur, something like
select id+0, count(*) from tableName group by 1
Also http://www.firebirdfaq.org/faq324/
when receiving customer database for investigation
BTW, how exactly did they created a copy of the database to give you?
Did they made back-up (FBK) ? If not, did they stopped Firebird server before making copies?

Trying to query data from an enormous SQL Server table using EF 5

I am working with a SQL Server table that contains 80 million (80,000,000) rows. Data space = 198,000 MB. Not surprisingly, queries against this table often churn or timeout. To add to the issues, the table rows get updated fairly frequently and new rows also get added on a regular basis. It thus continues to grow like a viral outbreak.
My issue is that I would like to write Entity Framework 5 LINQ to Entities queries to grab rows from this monster table. As I've tried, timeouts have become outright epidemic. A few more things: the table's primary key is indexed and it has non-clustered indexes on 4 of its 19 columns.
So far, I am writing simple LINQ queries that use Transaction Scope and Read Uncommitted Isolation Level. I have tried increasing both the command timeout and the connection timeout. I have written queries that return FirstOrDefault() or a collection, such as the following, which attempts to grab a single ID (an int) from seven days before the current date:
public int GetIDForSevenDaysAgo(DateTime sevenDaysAgo)
{
using (var txn = new TransactionScope(TransactionScopeOption.Required, new TransactionOptions { IsolationLevel = IsolationLevel.ReadUncommitted }))
{
var GetId = from te in _repo.GetTEvents()
where te.cr_date > sevenDaysAgo
orderby te.cr_date
select te.id;
return GetId.FirstOrDefault();
}
}
and
public IEnumerable<int> GetIDForSevenDaysAgo(DateTime sevenDaysAgo)
{
using (var txn = new TransactionScope(TransactionScopeOption.Required, new TransactionOptions { IsolationLevel = IsolationLevel.ReadUncommitted }))
{
var GetId = from te in _repo.GetTEvents()
where te.cr_date > sevenDaysAgo
orderby te.cr_date
select te.id;
return GetId.Take(1);
}
}
Each query times out repeatedly regardless of the timeout settings. I'm using the repository pattern with Unity DI and fetching the table with IQueryable<> calls. I'm also limiting the repository call to eight days from the current date (hoping to only grab the needed subset of this mammoth table). I'm using Visual Studio 2013 with Update 5 targeting .NET v4.5 and SQL Server 2008 R2.
I generated the SQL statement that EF generates and it didn't look incredibly more complicated than the LINQ statements above. And my brain hurts.
So, have I reached some sort of tolerance limit for EF? Is the table simply too big? Should I revert to Stored Procedures/domain methods when querying this table? Are there other options I should explore? There was some discussion around removing some of the table's rows, but that probably won't happen anytime soon. I did read a little about paging, but I'm not sure if that would help or not. Any thoughts or ideas would be appreciated! Thank you!
As I can see you only selecting data and don't change it. So why do you need to use TransactionScope? You need it only when you have 2 or more SaveChanges() in your code and you want them to be in one transaction. So get rid of it.
Another thing that i whould use in your case is disable change tracking and auto detection of changes on your context. But be carefull if you don't rectreade your context on each request. It can presist old data.
To do it you should write this lines near your context initialization:
context.ObjectTrackingEnabled = false;
context.DeferredLoadingEnabled = false;
The other thing that you should think about is pagenation and Cache. But as i can see in your example you trying to get only one row. So can't say anything particular.
I recommend you to read this article to further optimisation.
It's not easy to say if you have to go with stored procedures or EF since we speak for a monster. :-)
The first thing I would do is to run the query in SSMS displaying the Actual Execution Plan. Sometimes it provides information about indexes missing that might increase performance.
From you example, I 'm pretty sure you need an index on that date column.
In other words, -if you have access- be sure that table design is optimal for that amount of data.
My thought is that if a simple query hangs, what more EF can do?

Best performance approach to history mechanism?

We are going to create History Mechanism for our changes in DB (DART in pic) via Triggers.
we have 600 tables.
Each record that will be changed - the trigger will insert the deleted one into XXX.
regarding to the XXX :
option 1 : clone each table in "Dart" DB and each table now will have a "sister table"
e.g. :
Table1 will have Table1_History
problems :
we will have 1200 tables
programmer can do mistakes by working on wrong tables...
option 2 : make a new DB (DART_2005 in pic) and the history tables will be there
option 3 : use linked server which stores the Db which will contain the history tables.
question :
1) which option gives the best performance ( I guess 3 is not - but is it 1 or 2 or same ?)
2) Does option 2 is acting like "linked server" ( in queries we will need to select from both DB's...)
3) What is the best practice approach ?
All three approaches are viable and have similar performance based on your network speed, but each one will cause you a lot of headaches on a system with many concurrent users.
Since your will be inserting/updating multiple tables in one transaction with a very different access pattern (source table is random, history table is sequential) you will end up with blocking and or deadlocks.
If the existing table schema can not be changed
If you want to have a history system in place driven by your database ideally you will queue your history updates to prevent blocking problems.
Fire a trigger on update of your table
The trigger will submit a message containing the information from the inserted/deleted tables to a SQL Server Service Broker Queue
An activation stored procedure can pull the information from the queue and write it to the appropriate history table
On failure, a new message is sent to an "error queue" where a retry mechanism can re-submit to the original queue (make sure to include a retry counter in the message)
This way your history updates will be non-blocking and can not get lost.
Note: when working with SQL Server Service broker, make sure you completely understand the "Poison message" concept.
If the existing table schema can be changed
When this is an option, I recommend working with a "Record versioning" system where every update will create a new record & your application will correctly query the most recent version of the data. To ensure proper performance the table can be partitioned to the keep the most recent version of the data in a partition and the older versions in an archive partition. (I usually have a field end_date or expiration_date which is set to 9999/12/31 for the currently valid record.)
This approach of course requires considerable code changes in your data model and the existing application which might be not very cost effective.
1 and 2 will have similar performance; option 3 might be faster, if you are currently limited by some resource on the database server (e.g. disk IO), and you have a very fast network available to you.
Option 1 will lead to longer back-up times for your DART database - this may be a concern.
In general, I believe that if your application domain needs the concept of "history", you should build it in as a first-class feature. There are several approaches to this - check out the links in question How to create a point in time architecture in MySQL.
Again, in general, I dislike the use of triggers for this kind of requirement. Your trigger either has to be very simple - in which case it's not always easy to use the data it creates in your history table - or it has to be smart, in which case your trigger does a lot of work, which may make evolving your database schema harder in future.

Creating an index on a view with OpenQuery

SQL Server doesn't allow creating an view with schema binding where the view query uses OpenQuery as shown below.
Is there a way or a work-around to create an index on such a view?
The best you could do would be to schedule a periodic export of the AD data you are interested in to a table.
The table could of course then have all the indexes you like. If you ran the export every 10 minutes and the possibility of getting data that is 9 minutes and 59 seconds out of date is not a problem, then your queries will be lightning fast.
The only part of concern would be managing locking and concurrency during the export time. One strategy might be to export the data into a new table and then through renames swap it into place. Another might be to use SYNONYMs (SQL 2005 and up) to do something similar where you just point the SYNONYM to two alternating tables.
The data that supplies the query you're performing comes from a completely different system outside of SQL Server. There's no way that SQL Server can create an indexed view on data it does not own. For starters, how would it be notified when something had been changed so it could update its indexes? There would have to be some notification and update mechanism, which is implausible because SQL Server could not reasonably maintain ACID for such a distributed, slow, non-SQL server transaction to an outside system.
Thus my suggestion for mimicking such a thing through your own scheduled jobs that refresh the data every X minutes.
--Responding to your comment--
You can't tell whether a new user has been added without querying. If Active Directory supports some API that generates events, I've never heard of it.
But, each time you query, you could store the greatest creation time of all the users in a table, then through dynamic SQL, query only for new users with a creation date after that. This query should theoretically be very fast as it would pull very little data across the wire. You would just have to look into what the exact AD field would be for the creation date of the user and the syntax for conditions on that field.
If managing the dynamic SQL was too tough, a very simple vbscript, VB, or .Net application could also query active directory for you on a schedule and update the database.
Here are the basics for Indexed views and thier requirements. Note what you are trying to do would probably fall in the category of a Derived Table, therefore it is not possible to create an indexed view using "OpenQuery"
This list is from http://www.sqlteam.com/article/indexed-views-in-sql-server-2000
1.View definition must always return the same results from the same underlying data.
2.Views cannot use non-deterministic functions.
3.The first index on a View must be a clustered, UNIQUE index.
4.If you use Group By, you must include the new COUNT_BIG(*) in the select list.
5.View definition cannot contain the following
a.TOP
b.Text, ntext or image columns
c.DISTINCT
d.MIN, MAX, COUNT, STDEV, VARIANCE, AVG
e.SUM on a nullable expression
f.A derived table
g.Rowset function
h.Another view
i.UNION
j.Subqueries, outer joins, self joins
k.Full-text predicates like CONTAIN or FREETEXT
l.COMPUTE or COMPUTE BY
m.Cannot include order by in view definition
In this case, there is no way for SQL Server to know of any changes (data, schema, whatever) in the remote data source. For a local table, it can use SCHEMABINDING etc to ensure the underlying tables(s) stay the same and it can track datachanges.
If you need to query the view often, then I'd use a local table that is refreshed periodically. In fact, I'd use a table anyway. AD queries are't the quickest at the best of times...

Have you ever encountered a query that SQL Server could not execute because it referenced too many tables?

Have you ever seen any of there error messages?
-- SQL Server 2000
Could not allocate ancillary table for view or function resolution.
The maximum number of tables in a query (256) was exceeded.
-- SQL Server 2005
Too many table names in the query. The maximum allowable is 256.
If yes, what have you done?
Given up? Convinced the customer to simplify their demands? Denormalized the database?
#(everyone wanting me to post the query):
I'm not sure if I can paste 70 kilobytes of code in the answer editing window.
Even if I can this this won't help since this 70 kilobytes of code will reference 20 or 30 views that I would also have to post since otherwise the code will be meaningless.
I don't want to sound like I am boasting here but the problem is not in the queries. The queries are optimal (or at least almost optimal). I have spent countless hours optimizing them, looking for every single column and every single table that can be removed. Imagine a report that has 200 or 300 columns that has to be filled with a single SELECT statement (because that's how it was designed a few years ago when it was still a small report).
For SQL Server 2005, I'd recommend using table variables and partially building the data as you go.
To do this, create a table variable that represents your final result set you want to send to the user.
Then find your primary table (say the orders table in your example above) and pull that data, plus a bit of supplementary data that is only say one join away (customer name, product name). You can do a SELECT INTO to put this straight into your table variable.
From there, iterate through the table and for each row, do a bunch of small SELECT queries that retrieves all the supplemental data you need for your result set. Insert these into each column as you go.
Once complete, you can then do a simple SELECT * from your table variable and return this result set to the user.
I don't have any hard numbers for this, but there have been three distinct instances that I have worked on to date where doing these smaller queries has actually worked faster than doing one massive select query with a bunch of joins.
#chopeen You could change the way you're calculating these statistics, and instead keep a separate table of all per-product stats.. when an order is placed, loop through the products and update the appropriate records in the stats table. This would shift a lot of the calculation load to the checkout page rather than running everything in one huge query when running a report. Of course there are some stats that aren't going to work as well this way, e.g. tracking customers' next purchases after purchasing a particular product.
This would happen all the time when writing Reporting Services Reports for Dynamics CRM installations running on SQL Server 2000. CRM has a nicely normalised data schema which results in a lot of joins. There's actually a hotfix around that will up the limit from 256 to a whopping 260: http://support.microsoft.com/kb/818406 (we always thought this a great joke on the part of the SQL Server team).
The solution, as Dillie-O aludes to, is to identify appropriate "sub-joins" (preferably ones that are used multiple times) and factor them out into temp-table variables that you then use in your main joins. It's a major PIA and often kills performance. I'm sorry for you.
#Kevin, love that tee -- says it all :-).
I have never come across this kind of situation, and to be honest the idea of referencing > 256 tables in a query fills me with a mortal dread.
Your first question should probably by "Why so many?", closely followed by "what bits of information do I NOT need?" I'd be worried that the amount of data being returned from such a query would begin to impact performance of the application quite severely, too.
I'd like to see that query, but I imagine it's some problem with some sort of iterator, and while I can't think of any situations where its possible, I bet it's from a bad while/case/cursor or a ton of poorly implemented views.
Post the query :D
Also I feel like one of the possible problems could be having a ton (read 200+) of name/value tables which could condensed into a single lookup table.
I had this same problem... my development box runs SQL Server 2008 (the view worked fine) but on production (with SQL Server 2005) the view didn't. I ended up creating views to avoid this limitation, using the new views as part of the query in the view that threw the error.
Kind of silly considering the logical execution is the same...
Had the same issue in SQL Server 2005 (worked in 2008) when I wanted to create a view. I resolved the issue by creating a stored procedure instead of a view.

Resources