The setting 'auto create statistics' causes wildcard TEXT field searches to hang - sql-server

I have an interesting issue happening in Microsoft SQL when searching a TEXT field. I have a table with two fields, Id (int) and Memo (text), populated with hundreds of thousands of rows of data. Now, imagine a query, such as:
SELECT Id FROM Table WHERE Id=1234
Pretty simple. Let's assume there is a field with Id 1234, so it returns one row.
Now, let's add one more condition to the WHERE clause.
SELECT Id FROM Table WHERE Id=1234 AND Memo LIKE '%test%'
The query should pull one record, and then check to see if the word 'test' exists in the Memo field. However, if there is enough data, this statement will hang, as if it were searching the Memo field first, and then cross referencing the results with the Id field.
While this is what it is appearing to do, I just discovered that it is actually trying to create a statistic on the Memo field. If I turn off "auto create statistics", the query runs instantly.
So my quesiton is, how can you disable auto create statistics, but only for one query? Perhaps something like:
SET AUTO_CREATE_STATISTICS OFF
(I know, any normal person would just create a full text index on this field and call it a day. The reason I can't necessarily do this is because our data center is hosting an application for over 4,000 customers using the same database design. Not to mention, this problem happens on a variety of text fields in the database. So it would take tens of thousands of full text indexes if I went that route. Not to mention, adding a full text index would add storage requirements, backup changes, disaster recovery procedure changes, red tape paperwork, etc...)

I don't think you can turn this off on a per query basis.
Best you can do would be to identify all potentially problematic columns and then CREATE STATISTICS on them yourself with 0 ROWS or 0 PERCENT specified and NORECOMPUTE.
If you have a maintenance window you can run this in it would be best to run without this 0 ROWS qualifier but still leave the NORECOMPUTE in place.
You could also consider enabling AUTO_UPDATE_STATISTICS_ASYNC instead so that they are still rebuilt automatically but this happens in the background rather than holding up compilation of the current query but this is a database wide option.

Related

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

SQL pagination for on-the-fly data

I'm new to pagination, so I'm not sure I fully understand how it works. But here's what I want to do.
Basically, I'm creating a search engine of sorts that generates results from a database (MySQL). These results are merged together algorithmically, and then returned to the user.
My question is this: When the results are merged on the backend, do I need to create a temporary view with the results that is then used by the PHP pagination? Or do I create a table? I don't want a bunch of views and/or tables floating around for each and every query. Also, if I do use temporary tables, when are they destroyed? What if the user hits the "Back" button on his/her browser?
I hope this makes sense. Please ask for clarification if you don't understand. I've provided a little bit more information below.
MORE EXPLANATION: The database contains English words and phrases, each of which is mapped to a concept (Example: "apple" is 0.67 semantically-related to the concept of "cooking"). The user can enter in a bunch of keywords, and find the closest matching concept to each of those keywords. So I am mathematically combining the raw relational scores to find a ranked list of the most semantically-related concepts for the set of words the user enters. So it's not as simple as building a SQL query like "SELECT * FROM words WHERE blah blah..."
It depends on your database engine (i.e. what kind of SQL), but nearly each SQL flavor has support for paginating a query.
For example, MySQL has LIMIT and MS SQL has ROW_NUMBER.
So you build your SQL as usual, and then you just add the database engine-specific pagination stuff and the server automatically returns only, say, row 10 to 20 of the query result.
EDIT:
So the final query (which selects the data that is returned to the user) selects data from some tables (temporary or not), as I expected.
It's a SELECT query, which you can page with LIMIT in MySQL.
Your description sounds to me as if the actual calculation is way more resource-hogging than the final query which returns the results to the user.
So I would do the following:
get the individual results tables for the entered words, and save them in a table in a way that you can get the data for this specifiy query later (for example, with an additional column like SessionID or QueryID). No pagination here.
query these result tables again for the final query that is returned to the user.
Here you can do paging by using LIMIT.
So you have to do the actual calculation (the resource-hogging queries) only once when the user "starts" the query. Then you can return paginated results to the user by just selecting from the already populated results table.
EDIT 2:
I just saw that you accepted my answer, but still, here's more detail about my usage of "temporary" tables.
Of course this is only one possible way to do it. If the expected result is not too large, returning the whole resultset to the client, keeping it in memory and doing the paging client side (as you suggested) is possible as well.
But if we are talking about real huge amounts of data of which the user will only view a few (think Google search results), and/or low bandwidth, then you only want to transfer as little data as possible to the client.
That's what I was thinking about when I wrote this answer.
So: I don't mean a "real" temporary table, I'm talking about a "normal" table used for saving temporary data.
I'm way more proficient in MS SQL than in MySQL, so I don't know much about temp tables in MySQL.
I can tell you how I would do it in MS SQL, but maybe there's a better way to do this in MySQL that I don't know.
When I'd have to page a resource-intensive query, I want do the actual calculation once, save it in a table and then query that table several times from the client (to avoid doing the calculation again for each page).
The problem is: in MS SQL, a temp table only exists in the scope of the query where it is created.
So I can't use a temp table for that because it would be gone when I want to query it the second time.
So I use "real" tables for things like that.
I'm not sure whether I understood your algorithm example correct, so I'll simplify the example a bit. I hope that I can make my point clear anyway:
This is the table (this is probably not valid MySQL, it's just to show the concept):
create table AlgorithmTempTable
(
QueryID guid,
Rank float,
Value float
)
As I said before - it's not literally a "temporary" table, it's actually a real permanent table that is just used for temporary data.
Now the user opens your application, enters his search words and presses the "Search" button.
Then you start your resource-heavy algorithm to calculate the result once, and store it in the table:
insert into AlgorithmTempTable (QueryID, Rank, Value)
select '12345678-9012-3456789', foo, bar
from Whatever
insert into AlgorithmTempTable (QueryID, Rank, Value)
select '12345678-9012-3456789', foo2, bar2
from SomewhereElse
The Guid must be known to the client. Maybe you can use the client's SessionID for that (if he has one and if he can't start more than one query at once...or you generate a new Guid on the client each time the user presses the "Search" button, or whatever).
Now all the calculation is done, and the ranked list of results is saved in the table.
Now you can query the table, filtering by the QueryID:
select Rank, Value
from AlgorithmTempTable
where QueryID = '12345678-9012-3456789'
order by Rank
limit 0, 10
Because of the QueryID, multiple users can do this at the same time without interfering each other's query. If you create a new QueryID for each search, the same user can even run multiple queries at once.
Now there's only one thing left to do: delete the temporary data when it's not needed anymore (only the data! The table is never dropped).
So, if the user closes the query screen:
delete
from AlgorithmTempTable
where QueryID = '12345678-9012-3456789'
This is not ideal in some cases, though. If the application crashes, the data stays in the table forever.
There are several better ways. Which one is the best for you depends on your application. Some possibilities:
You can add a datetime column with the current time as default value, and then run a nightly (or weekly) job that deletes everything older than X
Same as above, but instead of a weekly job you can delete everything older than X every time someone starts a new query
If you have a session per user, you can save the SessionID in an additional column in the table. When the user logs out or the session expires, you can delete everything with that SessionID in the table
Paging results can be very tricky. They way I have done this is as follows. Set an upperbound limit for any query that may be run. For example say 5,000. If a query returns more than 5,000 then limit the results to 5,000.
This is best done using a stored procedure.
Store the results of the query into a temp table.
Select Page X's amount of data from the temp table.
Also return back the current page and total number of pages.

Update with "not in" on huge table in SQL Server 2005

I have a table with around 115k rows. Something like this:
Table: People
Column: ID PRIMARY KEY INT IDENTITY NOT NULL
Column: SpecialCode NVARCHAR(255) NULL
Column: IsActive BIT NOT NULL
Initially, I had an index defined like so:
PK_IDX (clustered) -- clustered index on primary key
IDX_SpecialCode (non clustered, non-unique) -- index on the SpecialCode column
And I'm doing an update like so:
Update People set IsActive = 0
Where SpecialCode not in ('...enormous list of special codes....')
This enormous list is essentially 99% of the users in the table.
This update takes forever on my server. As a test I trimmed the list of special codes in the "not in" clause to something like 1% of the users in the table, and my execution plan ends up using an INDEX SCAN on the PK_IDX index instead of the IDX_SpecialCode index that I thought it'd use.
So, I thought that maybe I needed to modify the IDX_SpecialCode so that it included the column "IsActive" in it. I did so and I still see the execution plan defaulting to the PK_IDX index scan and my query still takes a very long time to run.
So - what is the more correct way to do an update of this nature? I have the list of user's I want to exclude from the update, but was trying to avoid loading all employees special codes from the database, filtering out those not in my list on my application side, and then running my query with an in clause, which will be a much much smaller list in my actual usage.
Thanks
If you have the employees you want to exclude, why not just populate an indexed table with those PK_IDs and do a:
Update People
set IsActive = 0
Where NOT EXISTS (SELECT NULL
FROM lookuptable l
WHERE l.PK = People.PK)
You are getting index scans because SQL Server is not stupid, and realizes that it makes more sense to just look at the whole table instead of checking for 100 different criteria one at a time. If your stats are up to date the optimizer knows about how much of the table is covered by your IN statement and will do a table or clustered index scan if it thinks it will be faster.
With SQL-Server indexes are ignored when you use the NOT clause. That is why you are seeing the execution plan ignoring your index. <- Ref: page 6. MCTS Exam 70-433 Database Development SQL 2008 (I'm reading it at the moment)
It might be worth taking a look at Full text indexes although I don't know whether the same will happen with that (I haven't got access to a box with it set up to test at the moment)
hth
Is there any way you could use the IDs of the users you wish to exclude instead of their code - even on indexed values comparing ids may be faster than strings.
I think that the problem is your SpecialCode NVARCHAR(255). Strings comparison in Sql Server are very slow. Consider change your query to work with the IDs. And also, try to avoid the NVarchar. if dont care about Unicode, use Varchar instead.
Also, check your database collation to see if it matches the instance collation. Make sure you are not having hard disk performance issues.

How to improve performance in SQL Server table with image fields?

I'm having a very particular performance problem at work!
In the system we're using there's a table that holds information about the current workflow process. One of the fields holds a spreadsheet that contains metadata about the process (don't ask me why!! and NO I CAN'T CHANGE IT!!)
The problem is that this spreadsheet is stored in an IMAGE field in an SQL Server 2005 (within a database set with SQL 2000 compatibility).
This table currently has 22K+ lines and even a simple query like this:
SELECT TOP 100 *
FROM OFFENDING_TABLE
Takes 30 seconds to retrieve the data in Query Analyser.
I'm thinking about updating the compatibility to SQL 2005 (once that I was informed that the app can handle it).
The second thing I'm thinking is to change the data-type of the column to varbinary(max) but I don't know if doing this will affect the application.
Another thing that I'm considering is to use sp_tableoption to set the large value types out of row to 1 as it's currently 0, but I have no information if doing this will improve performance.
Does anyone know how to improve performance in such scenario?
Edited to clarify
My problem is that I have no control on what the application asks to the SQL Server, and I did some Reflection on it (the app is a .NET 1.1 website) and it uses the offending field for some internal stuff that I have no idea what it is.
I need to improve the overall performance of this table.
I'd recommend you look into the offending table layout health:
select * from sys.dm_db_index_physical_stats(
db_id(), object_id('offending_table'), null, null, detailed);
Things too look for are avg_fragmentation_in_percent, page_count, avg_page_space_used_in_percent, record_count and ghost_record_count. Cues like high fragmentation, or a high number of ghost records, or a low page used percent indicate problems and things can be improved quite a bit just by rebuilding the index (ie. the table) from scratch:
ALTER INDEX ALL ON offending_table REBUILD;
I'm saying this considering that you cannot change the table nor the app. If you'd be able to change the table and the app, the advice you already got is good advice (don't use '*', dont' select w/o a condition, use the newer varbinary(max) type etc etc).
I'd also look into the average page lifetime in performance counters to understand if the system is memory starved. From your description of the symptomps the system looks IO bound which leads me to think there is little page caching going on, and more RAM could help, as well as a faster IO subsytem. On a SQL 2008 system I would also suggest turning page compression on, but on 2005 you can't.
And, just to be sure, make sure the queries are not blocked by contention from the app itself, ie. the query doesn't spend 90% of that 30 seconds waiting for a row lock. Look at sys.dm_exec_requests while the query is running, see the wait_time, wait_type and wait_resource. Is it PAGEIOLATCH_XX? Or is it a lock? Also, how is the sys.dm_os_wait_stats in your server, what are the top wait reasons?
First of all - don't ever do a SELECT * in production code - reporting or not.
You have three basic choices:
move that blob field out into a separate table if it's not always needed; probably not practical since you mention you cannot change the schema
be more careful with your SELECT statements to select only those fields that you really need - and omit the blob field
see if you can limit your query to include a WHERE clause and find a way to optimize the query plan by e.g. adding a suitable index to the table (if you can)
There's no magic "make this faster" switch - but you can optimize your query or optimize your table layout. Both help. If you can't change anything - neither the table layout, nor add an index, nor change the queries, you'll have a hard time optimizing anything, I'm afraid....
Just changing the field to VARBINARY(MAX) won't change anything at all - no performance improvement to be expected just from changing the data type.
A short answer is to only do SELECTs against multiple rows when the fields returned do not include the offending image field, ie no SELECT *. If you want the value of the image field, retrieve it on a case-by-case basis.
Setting the large value types out of row option should definitely help performance. The row size will be significantly smaller, SQL Server can do a lot fewer physical reads to get throught the table.

What is a maintainable way to store large text fields without sacrificing performance?

I have been dancing around this issue for awhile but it keeps coming up. We have a system and our may of our tables start with a description that is originally stored as an NVARCHAR(150) and I then we get a ticket asking to expand the field size to 250, then 1000 etc, etc...
This cycle is repeated on ever "note" field and/or "description" field we add to most tables. Of course the concern for me is performance and breaking the 8k limit of the page. However, my other concern is making the system less maintainable by breaking these fields out of EVERY table in the system into a lazy loaded reference.
So here I am faced with these same to 2 options that have been staring me in the face. (others are welcome) please lend me your opinions.
Change all may notes and/or descriptions to NVARCHAR(MAX) and make sure we do exclude these fields in all listings. Basically never do a: SELECT * FROM [TableName] unless is it only retrieving one record.
Remove all notes and/or description fields and replace them with a forign key reference to a [Notes] table.
CREATE TABLE [dbo].[Notes] (
[NoteId] [int] NOT NULL,
[NoteText] [NVARCHAR](MAX)NOT NULL )
Obviously I would prefer use option 1 because it will change so much in our system if we go with 2. However if option 2 is really the only good way to proceed, then at least I can say these changes are necessary and I have done the homework.
UPDATE:
I ran several test on a sample database with 100,000 records in it. What I find is that the because of cluster index scans the IO required for option 1 is "roughly" twice that of option 2. If I select a large number of records (1000 or more) option 1 is twice as slow even if I do not include the large text field in the select. As I request less rows the lines blur more. I a web app where page sizes of 50 or so are the norm, so option 1 will work, but I will be converting all instances to option 2 in the (very) near future for scalability.
Option 2 is better for several reasons:
When querying your tables, the large
text fields fill up pages quickly,
forcing the database to scan more
pages to retrieve data. This is
especially taxing when you don't
actually need to return the text
data.
As you mentioned, it gives you
a clean break to change the data
type in one swoop. Microsoft has
deprecated TEXT in SQL Server 2008,
so you should stick with
VARCHAR/VARBINARY.
Separate filegroups. Having
all your text data in a slower,
cheaper storage location might be
something you decide to pursue in
the future. If not, no harm, no
foul.
While Option 1 is easier for now, Option 2 will give you more flexibility in the long-term. My suggestion would be to implement a simple proof-of-concept with the "notes" information separated from the main table and perform some of your queries on both examples. Compare the execution plans, client statistics and logical I/O reads (SET STATISTICS IO ON) for some of your queries against these tables.
A quick note to those suggesting the use of a TEXT/NTEXT from MSDN:
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature. Use varchar(max),
nvarchar(max) and varbinary(max) data
types instead. For more information,
see Using Large-Value Data Types.
I'd go with Option 2.
You can create a view that joins the two tables to make the transition easier on everyone, and then go through a clean-up process that removes the view and uses the single table wherever possible.
You want to use a TEXT field. TEXT fields aren't stored directly in the row; instead, it stores a pointer to the text data. This is transparent to queries, though - if you ask for a TEXT field, it will return the actual text, not the pointer.
Essentially, using a TEXT field is somewhat between your two solutions. It keeps your table rows much smaller than using a varchar, but you'll still want to avoid asking for them in your queries if possible.
The TEXT/NTEXT data type has practically unlimited length while taking up next to nothing in your record.
It comes with a few strings attached, like special behavior with string functions, but for a secondary "notes/description" type of field these may be less of a problem.
Just to expand on Option 2
You could:
Rename existing MyTable to MyTable_V2
Move the Notes column into a joined Notes table (with 1:1 joining ID)
Create a VIEW called MyTable that joins MyTable_V2 and Notes tables
Create an INSTEAD OF trigger on MyTable view which saves the Notes column into the Notes table (IF NULL then delete any existing Notes row, if NOT NULL then Insert if not found, otherwise Update). Perform appropriate action on MyTable_V2 table
Note: We've had trouble doing this where there is a Computed column in MyTable_V2 (I think that was the problem, either way we've hit snags when doing this with "unusual" tables)
All new Insert/Update/Delete code should be written to operate directly on MyTable_V2 and Notes tables
Optionally: Have the INSERT OF trigger on MyTable log the fact that it was called (it can do this minimally, UPDATE a pre-existing log table row with GetDate() only if existing row's date is > 24 hours old - so will only do an update once a day).
When you are no longer getting any log records you can drop the INSTEAD OF trigger on MyTable view and you are now fully MyTable_V2 compliant!
Huge amount of hassle to implement, as you surmised.
Alternatively trawl the code for all references to MyTable and change them to MyTable_V2, put a VIEW in place of MyTable for SELECT only, and not create the INSTEAD OF trigger.
My plan would be to fix all Insert/Update/Delete statements referencing the now deprecated MyTable. For me this would be made somewhat easier because we use unique names for all tables and columns in the database, and we use the same names in all application code, so making sure I had found all instances by a simple FIND would be high.
P.S. Option 2 is also preferable if you have any SELECT * lying around. We have had clients whos application performance has gone downhill fast when they added large Text/Blob columns to existing tables - because of "lazy" SELECT * statements. Hopefully that isn;t the case in your shop though!

Resources