Update with "not in" on huge table in SQL Server 2005 - sql-server

I have a table with around 115k rows. Something like this:
Table: People
Column: ID PRIMARY KEY INT IDENTITY NOT NULL
Column: SpecialCode NVARCHAR(255) NULL
Column: IsActive BIT NOT NULL
Initially, I had an index defined like so:
PK_IDX (clustered) -- clustered index on primary key
IDX_SpecialCode (non clustered, non-unique) -- index on the SpecialCode column
And I'm doing an update like so:
Update People set IsActive = 0
Where SpecialCode not in ('...enormous list of special codes....')
This enormous list is essentially 99% of the users in the table.
This update takes forever on my server. As a test I trimmed the list of special codes in the "not in" clause to something like 1% of the users in the table, and my execution plan ends up using an INDEX SCAN on the PK_IDX index instead of the IDX_SpecialCode index that I thought it'd use.
So, I thought that maybe I needed to modify the IDX_SpecialCode so that it included the column "IsActive" in it. I did so and I still see the execution plan defaulting to the PK_IDX index scan and my query still takes a very long time to run.
So - what is the more correct way to do an update of this nature? I have the list of user's I want to exclude from the update, but was trying to avoid loading all employees special codes from the database, filtering out those not in my list on my application side, and then running my query with an in clause, which will be a much much smaller list in my actual usage.
Thanks

If you have the employees you want to exclude, why not just populate an indexed table with those PK_IDs and do a:
Update People
set IsActive = 0
Where NOT EXISTS (SELECT NULL
FROM lookuptable l
WHERE l.PK = People.PK)
You are getting index scans because SQL Server is not stupid, and realizes that it makes more sense to just look at the whole table instead of checking for 100 different criteria one at a time. If your stats are up to date the optimizer knows about how much of the table is covered by your IN statement and will do a table or clustered index scan if it thinks it will be faster.

With SQL-Server indexes are ignored when you use the NOT clause. That is why you are seeing the execution plan ignoring your index. <- Ref: page 6. MCTS Exam 70-433 Database Development SQL 2008 (I'm reading it at the moment)
It might be worth taking a look at Full text indexes although I don't know whether the same will happen with that (I haven't got access to a box with it set up to test at the moment)
hth

Is there any way you could use the IDs of the users you wish to exclude instead of their code - even on indexed values comparing ids may be faster than strings.

I think that the problem is your SpecialCode NVARCHAR(255). Strings comparison in Sql Server are very slow. Consider change your query to work with the IDs. And also, try to avoid the NVarchar. if dont care about Unicode, use Varchar instead.
Also, check your database collation to see if it matches the instance collation. Make sure you are not having hard disk performance issues.

Related

Oracle starts to do full table scans when a column is changed from varchar to nclob

I have a table with about 100.000 rows that used to look more or less like this:
id varchar(20),
omg varchar(10),
ponies varchar(3000)
When adding support for international characters, we had to redefine the ponies column to an nclob, as 3000 (multibyte) characters is too big for an nvarchar
id varchar(20),
omg varchar(10),
ponies nclob
We read from the table using a prepared statement in java:
select omg, ponies from tbl where id = ?
After the 'ponies' column was changed to an NCLOB and some other tables where changed to use nchar columns, Oracle 11g decided to do a full table scan instead of using the index for the id column, which causes our application to grind to a halt.
When adding a hint to the query, the index is used and everything is "fine", or rather just a little bit more slow than it was when the column was a varchar.
We have defined the following connection properties:
oracle.jdbc.convertNcharLiterals="true"
defaultNChar=true
Btw, The database statistics are updated.
I have not had time to look at all queries, so I don't know if other indexes are ignored, but do I have to worry that the defaultNChar setting somehow is confusing the optimizer since the id is not a nchar? It would be rather awkward to either sprinkle hints on virtually all queries or redefine all keys.
Alternatively, is the full table scan regarded as insignificant as a "large" nclob is going to be loaded - that assumption seems to be off by 3 orders of magnitude, and I would like to believe that Oracle is smarter than that.
Or is it just bad luck? Or, something else? Is it possible to fix without hints?
The problem turns out to be the jdbc-flag defaultNChar=true.
Oracles optimizer will not use indexes created on char/varchar2 columns if the parameter is sent as a nchar/nvarchar. This is nearly making sense, as I suppose you could get phantom results.
We are mostly using stored procedures, with the parameters defined as char/varchar2 - forcing a conversion before the query is executed - so we didn't notice this effect except in a few places where dynamic sql is used.
The solution is to convert the database to AL32UTF8 and get rid of the nchar columns.
When you redid the statistics did you estimate or use dbms_stats.gather_table_stats with an estimate_percentage > 50%? If you didn't then use dbms_stats with a 100% estimate_percentage.
If your table is only 3 columns and these are the ones you're returning then the best index is all 3 columns no matter what you hint and even if the id index is unique. As it stands your explain plan should by a unique index scan followed by a table access by rowid. If you index all 3 columns this becomes a unique scan as all the information you're returning will be in the index already and there's no need to re-access the table to get it. The order would be id, omg, ponies to make use of it in the where clause. This would effectively make your table an index organized table, which would be easier than having a separate index. Obviously, gather stats after.
Saying all that I'm not actually certain you can index a nclob and no matter what you do the size of the column will have an impact as the longer it is the more disk reads you will have to do.
Sorry, but I don't understand why had you change your column ponies from varchar to clob. If your maximum lenght is 3000 char in this column, why don't you use a NVARCHAR2 column instead? As far as I know, nvarchar2 can hold up to 4000 characters.
But you're right, the maximum column size allowed is 2000 characters when the national character set is AL16UTF16 and 4000 when it is UTF8.

Oracle 11g: Index not used in "select distinct"-query

My question concerns Oracle 11g and the use of indexes in SQL queries.
In my database, there is a table that is structured as followed:
Table tab (
rowid NUMBER(11),
unique_id_string VARCHAR2(2000),
year NUMBER(4),
dynamic_col_1 NUMBER(11),
dynamic_col_1_text NVARCHAR2(2000)
) TABLESPACE tabspace_data;
I have created two indexes:
CREATE INDEX Index_dyn_col1 ON tab (dynamic_col_1, dynamic_col_1_text) TABLESPACE tabspace_index;
CREATE INDEX Index_unique_id_year ON tab (unique_id_string, year) TABLESPACE tabspace_index;
The table contains around 1 to 2 million records. I extract the data from it by executing the following SQL command:
SELECT distinct
"sub_select"."dynamic_col_1" "AS_dynamic_col_1","sub_select"."dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM
(
SELECT "tab".* FROM "tab"
where "tab".year = 2011
) "sub_select"
Unfortunately, the query needs around 1 hour to execute, although I created the both indexes described above.
The explain plan shows that Oracle uses a "Table Full Access", i.e. a full table scan. Why is the index not used?
As an experiment, I tested the following SQL command:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
Even in this case, the index is not used and a full table scan is performed.
In my real database, the table contains more indexed columns like "dynamic_col_1" and "dynamic_col_1_text".
The whole index file has a size of about 50 GB.
A few more informations:
The database is Oracle 11g installed on my local computer.
I use Windows 7 Enterprise 64bit.
The whole index is split over 3 dbf files with about 50GB size.
I would really be glad, if someone could tell me how to make Oracle use the index in the first query.
Because the first query is used by another program to extract the data from the database, it can hardly be changed. So it would be good to tweak the table instead.
Thanks in advance.
[01.10.2011: UPDATE]
I think I've found the solution for the problem. Both columns dynamic_col_1 and dynamic_col_1_text are nullable. After altering the table to prohibit "NULL"-values in both columns and adding a new index solely for the column year, Oracle performs a Fast Index Scan.
The advantage is that the query takes now about 5 seconds to execute and not 1 hour as before.
Are you sure that an index access would be faster than a full table scan? As a very rough estimate, full table scans are 20 times faster than reading an index. If tab has more than 5% of the data in 2011 it's not surprising that Oracle would use a full table scan. And as #Dan and #Ollie mentioned, with year as the second column this will make the index even slower.
If the index really is faster, than the issue is probably bad statistics. There are hundreds of ways the statistics could be bad. Very briefly, here's what I'd look at first:
Run an explain plan with and without and index hint. Are the cardinalities off by 10x or more? Are the times off by 10x or more?
If the cardinality is off, make sure there are up to date stats on the table and index and you're using a reasonable ESTIMATE_PERCENT (DBMS_STATS.AUTO_SAMPLE_SIZE is almost always the best for 11g).
If the time is off, check your workload statistics.
Are you using parallelism? Oracle always assumes a near linear improvement for parallelism, but on a desktop with one hard drive you probably won't see any improvement at all.
Also, this isn't really relevant to your problem, but you may want to avoid using quoted identifiers. Once you use them you have to use them everywhere, and it generally makes your tables and queries painful to work with.
Your index should be:
CREATE INDEX Index_year
ON tab (year)
TABLESPACE tabspace_index;
Also, your query could just be:
SELECT DISTINCT
dynamic_col_1 "AS_dynamic_col_1",
dynamic_col_1_text "AS_dynamic_col_1_text"
FROM tab
WHERE year = 2011;
If your index was created solely for this query though, you could create it including the two fetched columns as well, then the optimiser would not have to go to the table for the query data, it could retrieve it directly from the index making your query more efficient again.
Hope it helps...
I don't have an Oracle instance on hand so this is somewhat guesswork, but my inclination is to say it's because you have the compound index in the wrong order. If you had year as the first column in the index it might use it.
Your second test query:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
would not use the index because you have no WHERE clause, so you're asking Oracle to read every row in the table. In that situation the full table scan is the faster access method.
Also, as other posters have mentioned, your index on YEAR has it in the second column. Oracle can use this index by performing a skip scan, but there is a performance hit for doing so, and depending on the size of your table Oracle may just decide to use the FTS again.
I don't know if it's relevant, but I tested the following query:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
WHERE "dynamic_col_1" = 123 AND "dynamic_col_1_text" = 'abc'
The explain plan for that query show that Oracle uses an index scan in this scenario.
The columns dynamic_col_1 and dynamic_col_1_text are nullable. Does this have an effect on the usage of the index?
01.10.2011: UPDATE]
I think I've found the solution for the problem. Both columns dynamic_col_1 and dynamic_col_1_text are nullable. After altering the table to prohibit "NULL"-values in both columns and adding a new index solely for the column year, Oracle performs a Fast Index Scan. The advantage is that the query takes now about 5 seconds to execute and not 1 hour as before.
Try this:
1) Create an index on year field (see Ollie answer).
2) And then use this query:
SELECT DISTINCT
dynamic_col_1
,dynamic_col_1_text
FROM tab
WHERE ID (SELECT ID FROM tab WHERE year=2011)
or
SELECT DISTINCT
dynamic_col_1
,dynamic_col_1_text
FROM tab
WHERE ID (SELECT ID FROM tab WHERE year=2011)
GROUP BY dynamic_col_1, dynamic_col_1_text
Maybe it will help you.

MSSQL/Oracle Query Tuning 500,000+ records Coldfusion - does lower() reduce performance

I'm not trying to start a debate on which is better in general, I'm asking specifically to this question. :)
I need to write a query to pull back a list of userid (uid) from a database containing 500k+ records. I'm returning just the one field, uid. I can query either our Oracle box or our MSSQL 2000 box. The query looks like this (this has not been simplied)
select uid
from employeeRec
where uid = 'abc123'
Yes, it really is that simply of a query. Where I need the tuninig help is that the uid is indexed and some uid could be (not many but some) 'ABC123' or 'abc123'. MSSQL doesn't care of the case-sensitivity whereas Oracle does. So for Oracle, my query would look like this:
select uid
from employeeRec
where lower(uid) = 'abc123'
I've learned that if you use lower on an index field in MSSQL, you render the index useless (there are ways around it but that is beyond the scope of my question here - since if I choose MSSQL, I don't need to use lower at all). I wanted to know if I choose Oracle, and use the lower() function, will that also hurt performance of the query?
I'm looping over this query about 200 times in addition to some other queries that are being run and to process the entire loop takes 1 second per iteration and I've narrowed down the slowness to this particular query. For a web page, 200 seconds seems like eternity. For you CF readers, timeout value has been increased so the page doesn't error out and there are no page errors, I'm just trying to speed up this query.
Another item to note: This database is in a different city than the other queries being run so I do expect some lag time there.
As TomTom put, your index will simply not be used by Oracle. But, you can create a function based index, and this new index will be used when you issue your query.
create index my_new_ix on employeeRec(lower(uid));
Wrapping an indexed column in a function call would have the potential to cause performance problems in Oracle. Oracle couldn't use a plain index on UID to process your query. On the other hand, you could create a function-based index on lower(uid) that would be used by the query, i.e.
CREATE INDEX case_insensitive_idx
ON employeeRec( lower( uid ) );
Note that if you want to do case-insensitive queries in general, you may be better served setting NLS parameters to force case-insensitivity. You'd still need function-based indexes on the columns you're searching on, but it can simplify your queries a bit.
I wanted to know if I choose Oracle,
and use the lower() function, will
that also hurt performance of the
query?
Yes. The perforamnce reduction is because the index is on the original value and the collation i case sensitive, so all possible values must be run through the function to filter out the ones matching.

Why is doing a top(1) on an indexed column in SQL Server slow?

I'm puzzled by the following. I have a DB with around 10 million rows, and (among other indices) on 1 column (campaignid_int) is an index.
Now I have 700k rows where the campaignid is indeed 3835
For all these rows, the connectionid is the same.
I just want to find out this connectionid.
use messaging_db;
SELECT TOP (1) connectionid
FROM outgoing_messages WITH (NOLOCK)
WHERE (campaignid_int = 3835)
Now this query takes approx 30 seconds to perform!
I (with my small db knowledge) would expect that it would take any of the rows, and return me that connectionid
If I test this same query for a campaign which only has 1 entry, it goes really fast. So the index works.
How would I tackle this and why does this not work?
edit:
estimated execution plan:
select (0%) - top (0%) - clustered index scan (100%)
Due to the statistics, you should explicitly ask the optimizer to use the index you've created instead of the clustered one.
SELECT TOP (1) connectionid
FROM outgoing_messages WITH (NOLOCK, index(idx_connectionid))
WHERE (campaignid_int = 3835)
I hope it will solve the issue.
Regards,
Enrique
I recently had the same issue and it's really quite simple to solve (at least in some cases).
If you add an ORDER BY-clause on any or some of the columns that's indexed it should be solved. That solved it for me at least.
You aren't specifying an ORDER BY clause in your query, so the optimiser is not being instructed as to the sort order it should be selecting the top 1 from. SQL Server won't just take a random row, it will order the rows by something and take the top 1, and it may be choosing to order by something that is sub-optimal. I would suggest that you add an ORDER BY x clause, where x being the clustered key on that table will probably be the fastest.
This may not solve your problem -- in fact I'm not sure I expect it to from the statistics you've given -- but (a) it won't hurt, and (b) you'll be able to rule this out as a contributing factor.
If the campaignid_int column is not indexed, add an index to it. That should speed up the query. Right now I presume that you need to do a full table scan to find the matches for campaignid_int = 3835 before the top(1) row is returned (filtering occurs before results are returned).
EDIT: An index is already in place, but since SQL Server does a clustered index scan, the optimizer has ignored the index. This is probably due to (many) duplicate rows with the same campaignid_int value. You should consider indexing differently or query on a different column to get the connectionid you want.
The index may be useless for 2 reasons:
700k in 10 million may be not selective enough
and /or
connectionid needs included so the entire query can used only an index
Otherwise, the optimiser decides it may as well use the PK/clustered index to both filter on campaignid_int and get connectionid, to avoid a bookmark lookup on 700k rows from the current index.
So, I suggest this...
CREATE NONCLUSTERED INDEX IX_Foo ON MyTable (campaignid_int) INCLUDE (connectionid)
This doesn't answer your question, but try using:
SET ROWCOUNT 1
SELECT connectionid
FROM outgoing_messages WITH (NOLOCK)
WHERE (campaignid_int = 3835)
I've seen top(x) perform very badly in certain situations as well. I'm sure it's doing a full table scan. Perhaps your index on that particular column needs to be rebuilt? The above is worth a try, however.
Your query does not work as you expect, because Sql Server keeps statistics about your index and in this particular case knows that there are a lot of duplicate rows with the identifier 3835, hence it figures that it would make more sense to just do a full index (or table) scan. When you test for an ID which resolves to only one row, it uses the index as expected, i.e. performs an index seek (the execution plan should verify this guess).
Possible solutions ? Make the index composite, if you have anything to compose it with, that is, e.g. compose it with the date the message was sent (if I understand your case correctly) and then select the top 1 entry from the list with the specified id ordered by the date. Though I'm not sure whether this would be better (for one, a composite index takes up more space) - just a guess.
EDIT: I just tried out the suggestion of making the index composite by adding a date column. If you do that and specify order by date in your query, an index seek is performed as expected.
but since I'm specifying 'top(1)' it
means: give me any row. Why would it
first crawl through the 700k rows just
to return one? – reinier 30 mins ago
Sorry, can't comment yet but the answer here is that SQL server is not going to understand the human equivalent of "Bring me the first one you find" when it hears "Top 1". Instead of the expected "Give me any row" SQL Server goes and fetches the first of all found rows.
Only time it knows that is after fetching all rows first, then discarding the rest. Very thorough but in your case not really fast.
Main issue as other said are your statistics and selectivity of your index. If you have another unique field in your table (like an identity column) then try an combined index on campaignid_int first, unique column second. As you only query on campaignid_int it has to be the first part of the key.
Sounds worth a try as this index should have a higher selectivity thus the optimizer can use this better than doing an index crawl.

SQL Server STATISTICS

So for this one project, we have a bunch of queries that are executed on a regular basis (every minute or so. I used the "Analyze Query in Database Engine " to check on them.
They are pretty simple:
select * from tablex where processed='0'
There is an index on processed, and each query should return <1000 rows on a table with 1MM records.
The Analyzer recommended creating some STATISTICS on this.... So my question is: What are those statistics ? do they really help performance ? how costly are they for a table like above ?
Please bear in mind that by no means I would call myself a SQL Server experienced user ... And this is the first time using this Analyzer.
Statistics are what SQL Server uses to determine the viability of how to get data.
Let's say, for instance, that you have a table that only has a clustered index on the primary key. When you execute SELECT * FROM tablename WHERE col1=value, SQL Server only has one option, to scan every row in the table to find the matching rows.
Now we add an index on col1 so you assume that SQL Server will use the index to find the matching rows, but that's not always true. Let's say that the table has 200,000 rows and col1 only has 2 values: 1 and 0. When SQL Server uses an index to find data, the index contains pointers back to the clustered index position. Given there's only two values in the indexed column, SQL Server decides it makes more sense to just scan the table because using the index would be more work.
Now we'll add another 800,000 rows of data to the table, but this time the values in col1 are widely varied. Now it's a useful index because SQL Server can viably use the index to limit what it needs to pull out of the table. Will SQL Server use the index?
It depends. And what it depends on are the Statistics. At some point in time, with AUTO UPDATE STATISTICS set on, the server will update the statistics for the index and know it's a very good and valid index to use. Until that point, however, it will ignore the index as being irrelevant.
That's one use of statistics. But there is another use and that isn't related to indices. SQL Server keeps basic statistics about all of the columns in a table. If there's enough different data to make it worthwhile, SQL Server will actually create a temporary index on a column and use that to filter. While this takes more time than using an existing index, it takes less time than a full table scan.
Sometimes you will get recommendations to create specific statistics on columns that would be useful for that. These aren't indices, but the do keep track of the statistical sampling of data in the column so SQL Server can determine whether it makes sense to create a temporary index to return data.
HTH
In Sql Server 2005, set auto create statistics and auto update statistics. You won't have to worry about creating them or maintaining them yourself, since the database handles this very well itself.

Resources