Oracle 11g: Index not used in "select distinct"-query - database

My question concerns Oracle 11g and the use of indexes in SQL queries.
In my database, there is a table that is structured as followed:
Table tab (
rowid NUMBER(11),
unique_id_string VARCHAR2(2000),
year NUMBER(4),
dynamic_col_1 NUMBER(11),
dynamic_col_1_text NVARCHAR2(2000)
) TABLESPACE tabspace_data;
I have created two indexes:
CREATE INDEX Index_dyn_col1 ON tab (dynamic_col_1, dynamic_col_1_text) TABLESPACE tabspace_index;
CREATE INDEX Index_unique_id_year ON tab (unique_id_string, year) TABLESPACE tabspace_index;
The table contains around 1 to 2 million records. I extract the data from it by executing the following SQL command:
SELECT distinct
"sub_select"."dynamic_col_1" "AS_dynamic_col_1","sub_select"."dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM
(
SELECT "tab".* FROM "tab"
where "tab".year = 2011
) "sub_select"
Unfortunately, the query needs around 1 hour to execute, although I created the both indexes described above.
The explain plan shows that Oracle uses a "Table Full Access", i.e. a full table scan. Why is the index not used?
As an experiment, I tested the following SQL command:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
Even in this case, the index is not used and a full table scan is performed.
In my real database, the table contains more indexed columns like "dynamic_col_1" and "dynamic_col_1_text".
The whole index file has a size of about 50 GB.
A few more informations:
The database is Oracle 11g installed on my local computer.
I use Windows 7 Enterprise 64bit.
The whole index is split over 3 dbf files with about 50GB size.
I would really be glad, if someone could tell me how to make Oracle use the index in the first query.
Because the first query is used by another program to extract the data from the database, it can hardly be changed. So it would be good to tweak the table instead.
Thanks in advance.
[01.10.2011: UPDATE]
I think I've found the solution for the problem. Both columns dynamic_col_1 and dynamic_col_1_text are nullable. After altering the table to prohibit "NULL"-values in both columns and adding a new index solely for the column year, Oracle performs a Fast Index Scan.
The advantage is that the query takes now about 5 seconds to execute and not 1 hour as before.

Are you sure that an index access would be faster than a full table scan? As a very rough estimate, full table scans are 20 times faster than reading an index. If tab has more than 5% of the data in 2011 it's not surprising that Oracle would use a full table scan. And as #Dan and #Ollie mentioned, with year as the second column this will make the index even slower.
If the index really is faster, than the issue is probably bad statistics. There are hundreds of ways the statistics could be bad. Very briefly, here's what I'd look at first:
Run an explain plan with and without and index hint. Are the cardinalities off by 10x or more? Are the times off by 10x or more?
If the cardinality is off, make sure there are up to date stats on the table and index and you're using a reasonable ESTIMATE_PERCENT (DBMS_STATS.AUTO_SAMPLE_SIZE is almost always the best for 11g).
If the time is off, check your workload statistics.
Are you using parallelism? Oracle always assumes a near linear improvement for parallelism, but on a desktop with one hard drive you probably won't see any improvement at all.
Also, this isn't really relevant to your problem, but you may want to avoid using quoted identifiers. Once you use them you have to use them everywhere, and it generally makes your tables and queries painful to work with.

Your index should be:
CREATE INDEX Index_year
ON tab (year)
TABLESPACE tabspace_index;
Also, your query could just be:
SELECT DISTINCT
dynamic_col_1 "AS_dynamic_col_1",
dynamic_col_1_text "AS_dynamic_col_1_text"
FROM tab
WHERE year = 2011;
If your index was created solely for this query though, you could create it including the two fetched columns as well, then the optimiser would not have to go to the table for the query data, it could retrieve it directly from the index making your query more efficient again.
Hope it helps...

I don't have an Oracle instance on hand so this is somewhat guesswork, but my inclination is to say it's because you have the compound index in the wrong order. If you had year as the first column in the index it might use it.

Your second test query:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
would not use the index because you have no WHERE clause, so you're asking Oracle to read every row in the table. In that situation the full table scan is the faster access method.
Also, as other posters have mentioned, your index on YEAR has it in the second column. Oracle can use this index by performing a skip scan, but there is a performance hit for doing so, and depending on the size of your table Oracle may just decide to use the FTS again.

I don't know if it's relevant, but I tested the following query:
SELECT DISTINCT
"dynamic_col_1" "AS_dynamic_col_1", "dynamic_col_1_text" "AS_dynamic_col_1_text"
FROM "tab"
WHERE "dynamic_col_1" = 123 AND "dynamic_col_1_text" = 'abc'
The explain plan for that query show that Oracle uses an index scan in this scenario.
The columns dynamic_col_1 and dynamic_col_1_text are nullable. Does this have an effect on the usage of the index?
01.10.2011: UPDATE]
I think I've found the solution for the problem. Both columns dynamic_col_1 and dynamic_col_1_text are nullable. After altering the table to prohibit "NULL"-values in both columns and adding a new index solely for the column year, Oracle performs a Fast Index Scan. The advantage is that the query takes now about 5 seconds to execute and not 1 hour as before.

Try this:
1) Create an index on year field (see Ollie answer).
2) And then use this query:
SELECT DISTINCT
dynamic_col_1
,dynamic_col_1_text
FROM tab
WHERE ID (SELECT ID FROM tab WHERE year=2011)
or
SELECT DISTINCT
dynamic_col_1
,dynamic_col_1_text
FROM tab
WHERE ID (SELECT ID FROM tab WHERE year=2011)
GROUP BY dynamic_col_1, dynamic_col_1_text
Maybe it will help you.

Related

How to store 300M records in Postgresql to run efficiency queries

I have following table:
CREATE TABLE public.shop_prices
(
shop_name text COLLATE pg_catalog."default",
product text COLLATE pg_catalog."default",
product_category text COLLATE pg_catalog."default",
price text COLLATE pg_catalog."default"
)
and for this table i have a dataset from 18 months. In each file there are about 15M records. I have to some analysis, like in which month a shop has increased or decreased their price. I imported two months in a table and run following query just to test:
select shop, product from shop_prices group by shop, product limit 10
I waited more than 5 minutes, but no any result and response. It was still on working. What is the best way the store these datasets and run efficiency queries? Is it a good idea if I create for each dataset a seperate tables?
Using explain analyze select shop_name, product from shop_prices group by shop, product limit 10 you can see how Postgres is planning and executing the query and the time the execution takes. You'll see it needs to read the whole table (with the time consuming disk reads) and then sort it in memory - which will probably need to be cached on disk, before returning the results. In the next run you might discover the same query is very snappy if the number of shop_name+product combinations are very limited and thus stored in pg_stats after that explain analyze. The point being that a simple query like this can be deceiving.
You will faster execution by creating an index on the columns you are using (create index shop_prices_shop_prod_idx on public.shop_prices(shop_name,product)).
You should definitely change the price column type to numeric (or float/float8)) if you plan to do any numerical calculations on it.
Having said all that, I suspect this table is not what you will be using as it does not have any timestamp to compare prices between months to begin with.
I suggest you complete the table design and speculate on indices to improve performance. You might even want consider table partitioning https://www.postgresql.org/docs/current/ddl-partitioning.html
You will probably be doing all sorts of queries on this data so there is no simple solution to them all.
By all means return with perhaps more specific questions with complete table description and the output from the explain analyze statement for queries you are trying out and get some good advice.
Best regards,
Bjarni
What is your PostgreSQL version ?
First there is a typo: column shop should be shop_name.
Second you query looks strange because it has only a LIMIT clause without any ORDER BY clause or WHERE clause: do you really want to have "random" rows for this query ?
Can you try to post EXPLAIN output for the SQL statement:
explain select shop_name, product from shop_prices group by shop_name, product limit 10;
Can you also check if any statistics have been computed for this table with:
select * from pg_stats where tablename='shop_prices';

KeyLookup in massive columns in sql server

I have one simple query which has multiple columns (more than 1000).
When i run with single column it gives me result in 2 seconds with proper index seek, logical read, cpu and every thing is under thresholds.
But when i select more than 1000 columns it takes 11 mins for the result and gives me key lookup.
You folks have you faced this type of issue?
Any suggestion on that issue?
Normally, I would suggest to add those columns in the INCLUDE fields of your non-clustered index. Adding them in the INCLUDE removes the LOOKUP in the execution plan. But as everything with SQL Server, it depends. Depending on how the table is used i.e, if you're updating the table more than just plain SELECTing on it, then the LOOKUP might be ok.
If this query is run once per year, the overhead of additional index is probably not worth it. If you need quick response time, that single time of the year when it needs to be run, look into 'pre executing' it and just present the result to the user.
The difference in your query plan might be because of join elimination (if your query contains JOINs with multiple tables) or just that the additional columns you are requesting do not exist in your currently existing indexes...

Update with "not in" on huge table in SQL Server 2005

I have a table with around 115k rows. Something like this:
Table: People
Column: ID PRIMARY KEY INT IDENTITY NOT NULL
Column: SpecialCode NVARCHAR(255) NULL
Column: IsActive BIT NOT NULL
Initially, I had an index defined like so:
PK_IDX (clustered) -- clustered index on primary key
IDX_SpecialCode (non clustered, non-unique) -- index on the SpecialCode column
And I'm doing an update like so:
Update People set IsActive = 0
Where SpecialCode not in ('...enormous list of special codes....')
This enormous list is essentially 99% of the users in the table.
This update takes forever on my server. As a test I trimmed the list of special codes in the "not in" clause to something like 1% of the users in the table, and my execution plan ends up using an INDEX SCAN on the PK_IDX index instead of the IDX_SpecialCode index that I thought it'd use.
So, I thought that maybe I needed to modify the IDX_SpecialCode so that it included the column "IsActive" in it. I did so and I still see the execution plan defaulting to the PK_IDX index scan and my query still takes a very long time to run.
So - what is the more correct way to do an update of this nature? I have the list of user's I want to exclude from the update, but was trying to avoid loading all employees special codes from the database, filtering out those not in my list on my application side, and then running my query with an in clause, which will be a much much smaller list in my actual usage.
Thanks
If you have the employees you want to exclude, why not just populate an indexed table with those PK_IDs and do a:
Update People
set IsActive = 0
Where NOT EXISTS (SELECT NULL
FROM lookuptable l
WHERE l.PK = People.PK)
You are getting index scans because SQL Server is not stupid, and realizes that it makes more sense to just look at the whole table instead of checking for 100 different criteria one at a time. If your stats are up to date the optimizer knows about how much of the table is covered by your IN statement and will do a table or clustered index scan if it thinks it will be faster.
With SQL-Server indexes are ignored when you use the NOT clause. That is why you are seeing the execution plan ignoring your index. <- Ref: page 6. MCTS Exam 70-433 Database Development SQL 2008 (I'm reading it at the moment)
It might be worth taking a look at Full text indexes although I don't know whether the same will happen with that (I haven't got access to a box with it set up to test at the moment)
hth
Is there any way you could use the IDs of the users you wish to exclude instead of their code - even on indexed values comparing ids may be faster than strings.
I think that the problem is your SpecialCode NVARCHAR(255). Strings comparison in Sql Server are very slow. Consider change your query to work with the IDs. And also, try to avoid the NVarchar. if dont care about Unicode, use Varchar instead.
Also, check your database collation to see if it matches the instance collation. Make sure you are not having hard disk performance issues.

Why is doing a top(1) on an indexed column in SQL Server slow?

I'm puzzled by the following. I have a DB with around 10 million rows, and (among other indices) on 1 column (campaignid_int) is an index.
Now I have 700k rows where the campaignid is indeed 3835
For all these rows, the connectionid is the same.
I just want to find out this connectionid.
use messaging_db;
SELECT TOP (1) connectionid
FROM outgoing_messages WITH (NOLOCK)
WHERE (campaignid_int = 3835)
Now this query takes approx 30 seconds to perform!
I (with my small db knowledge) would expect that it would take any of the rows, and return me that connectionid
If I test this same query for a campaign which only has 1 entry, it goes really fast. So the index works.
How would I tackle this and why does this not work?
edit:
estimated execution plan:
select (0%) - top (0%) - clustered index scan (100%)
Due to the statistics, you should explicitly ask the optimizer to use the index you've created instead of the clustered one.
SELECT TOP (1) connectionid
FROM outgoing_messages WITH (NOLOCK, index(idx_connectionid))
WHERE (campaignid_int = 3835)
I hope it will solve the issue.
Regards,
Enrique
I recently had the same issue and it's really quite simple to solve (at least in some cases).
If you add an ORDER BY-clause on any or some of the columns that's indexed it should be solved. That solved it for me at least.
You aren't specifying an ORDER BY clause in your query, so the optimiser is not being instructed as to the sort order it should be selecting the top 1 from. SQL Server won't just take a random row, it will order the rows by something and take the top 1, and it may be choosing to order by something that is sub-optimal. I would suggest that you add an ORDER BY x clause, where x being the clustered key on that table will probably be the fastest.
This may not solve your problem -- in fact I'm not sure I expect it to from the statistics you've given -- but (a) it won't hurt, and (b) you'll be able to rule this out as a contributing factor.
If the campaignid_int column is not indexed, add an index to it. That should speed up the query. Right now I presume that you need to do a full table scan to find the matches for campaignid_int = 3835 before the top(1) row is returned (filtering occurs before results are returned).
EDIT: An index is already in place, but since SQL Server does a clustered index scan, the optimizer has ignored the index. This is probably due to (many) duplicate rows with the same campaignid_int value. You should consider indexing differently or query on a different column to get the connectionid you want.
The index may be useless for 2 reasons:
700k in 10 million may be not selective enough
and /or
connectionid needs included so the entire query can used only an index
Otherwise, the optimiser decides it may as well use the PK/clustered index to both filter on campaignid_int and get connectionid, to avoid a bookmark lookup on 700k rows from the current index.
So, I suggest this...
CREATE NONCLUSTERED INDEX IX_Foo ON MyTable (campaignid_int) INCLUDE (connectionid)
This doesn't answer your question, but try using:
SET ROWCOUNT 1
SELECT connectionid
FROM outgoing_messages WITH (NOLOCK)
WHERE (campaignid_int = 3835)
I've seen top(x) perform very badly in certain situations as well. I'm sure it's doing a full table scan. Perhaps your index on that particular column needs to be rebuilt? The above is worth a try, however.
Your query does not work as you expect, because Sql Server keeps statistics about your index and in this particular case knows that there are a lot of duplicate rows with the identifier 3835, hence it figures that it would make more sense to just do a full index (or table) scan. When you test for an ID which resolves to only one row, it uses the index as expected, i.e. performs an index seek (the execution plan should verify this guess).
Possible solutions ? Make the index composite, if you have anything to compose it with, that is, e.g. compose it with the date the message was sent (if I understand your case correctly) and then select the top 1 entry from the list with the specified id ordered by the date. Though I'm not sure whether this would be better (for one, a composite index takes up more space) - just a guess.
EDIT: I just tried out the suggestion of making the index composite by adding a date column. If you do that and specify order by date in your query, an index seek is performed as expected.
but since I'm specifying 'top(1)' it
means: give me any row. Why would it
first crawl through the 700k rows just
to return one? – reinier 30 mins ago
Sorry, can't comment yet but the answer here is that SQL server is not going to understand the human equivalent of "Bring me the first one you find" when it hears "Top 1". Instead of the expected "Give me any row" SQL Server goes and fetches the first of all found rows.
Only time it knows that is after fetching all rows first, then discarding the rest. Very thorough but in your case not really fast.
Main issue as other said are your statistics and selectivity of your index. If you have another unique field in your table (like an identity column) then try an combined index on campaignid_int first, unique column second. As you only query on campaignid_int it has to be the first part of the key.
Sounds worth a try as this index should have a higher selectivity thus the optimizer can use this better than doing an index crawl.

SQL Server STATISTICS

So for this one project, we have a bunch of queries that are executed on a regular basis (every minute or so. I used the "Analyze Query in Database Engine " to check on them.
They are pretty simple:
select * from tablex where processed='0'
There is an index on processed, and each query should return <1000 rows on a table with 1MM records.
The Analyzer recommended creating some STATISTICS on this.... So my question is: What are those statistics ? do they really help performance ? how costly are they for a table like above ?
Please bear in mind that by no means I would call myself a SQL Server experienced user ... And this is the first time using this Analyzer.
Statistics are what SQL Server uses to determine the viability of how to get data.
Let's say, for instance, that you have a table that only has a clustered index on the primary key. When you execute SELECT * FROM tablename WHERE col1=value, SQL Server only has one option, to scan every row in the table to find the matching rows.
Now we add an index on col1 so you assume that SQL Server will use the index to find the matching rows, but that's not always true. Let's say that the table has 200,000 rows and col1 only has 2 values: 1 and 0. When SQL Server uses an index to find data, the index contains pointers back to the clustered index position. Given there's only two values in the indexed column, SQL Server decides it makes more sense to just scan the table because using the index would be more work.
Now we'll add another 800,000 rows of data to the table, but this time the values in col1 are widely varied. Now it's a useful index because SQL Server can viably use the index to limit what it needs to pull out of the table. Will SQL Server use the index?
It depends. And what it depends on are the Statistics. At some point in time, with AUTO UPDATE STATISTICS set on, the server will update the statistics for the index and know it's a very good and valid index to use. Until that point, however, it will ignore the index as being irrelevant.
That's one use of statistics. But there is another use and that isn't related to indices. SQL Server keeps basic statistics about all of the columns in a table. If there's enough different data to make it worthwhile, SQL Server will actually create a temporary index on a column and use that to filter. While this takes more time than using an existing index, it takes less time than a full table scan.
Sometimes you will get recommendations to create specific statistics on columns that would be useful for that. These aren't indices, but the do keep track of the statistical sampling of data in the column so SQL Server can determine whether it makes sense to create a temporary index to return data.
HTH
In Sql Server 2005, set auto create statistics and auto update statistics. You won't have to worry about creating them or maintaining them yourself, since the database handles this very well itself.

Resources