Why is mysql ignoring the 'obvious' key to use in this simple join query?

Why is mysql ignoring the 'obvious' key to use in this simple join query? - django-models

I have what I'd thought would be a simple query, but it takes 'forever'. I'm not great with SQL optimizations, so I thought I could ask you guys.
Here's the query, with EXPLAIN:
EXPLAIN SELECT *
FROM `firms_firmphonenumber`
INNER JOIN `firms_location` ON (
`firms_firmphonenumber`.`location_id` = `firms_location`.`id`
)
ORDER BY
`firms_location`.`name_en` ASC,
`firms_firmphonenumber`.`location_id` ASC LIMIT 100;
Result:
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, 'SIMPLE', 'firms_location', 'ALL', 'PRIMARY', '', '', '', 73030, 'Using temporary; Using filesort'
1, 'SIMPLE', 'firms_firmphonenumber', 'ref', 'firms_firmphonenumber_firm_id', 'firms_firmphonenumber_firm_id', '4', 'citiadmin.firms_location.id', 1, ''
Keys on firms_location:
Keyname Type Unique Packed Field Cardinality
PRIMARY BTREE Yes No id 65818
firms_location_name_en BTREE No No name_en 65818
Keys on firms_firmphonenumber:
Keyname Type Unique Packed Field Cardinality
PRIMARY BTREE Yes No id 85088
firms_firmphonenumber_firm_id BTREE No No location_id 85088
It seems (to me) that mySQL refuses to use the firms_location table's primary key - but I have no idea why.
Any help would be much appreciated.
Edit after solution posted
With the altered order by:
EXPLAIN SELECT *
FROM `firms_firmphonenumber`
INNER JOIN `firms_location` ON (
`firms_firmphonenumber`.`location_id` = `firms_location`.`id`
)
ORDER BY
`firms_location`.`name_en` ASC,
`firms_location`.id ASC LIMIT 100;
#`firms_firmphonenumber`.`location_id` ASC LIMIT 100;
Result:
"id","select_type","table","type","possible_keys","key","key_len","ref","rows","Extra"
1,"SIMPLE","firms_location","index","PRIMARY","firms_location_name_en","767","",100,""
1,"SIMPLE","firms_firmphonenumber","ref","firms_firmphonenumber_firm_id","firms_firmphonenumber_firm_id","4","citiadmin.firms_location.id",1,""
Why did it decide to use these now? mySQL makes some odd choices... Any insight would help again :)
Edit with detail from django
Originally, I had these (abbreviated) models:
class Location(models.Model):
id = models.AutoField(primary_key=True)
name_en = models.CharField(max_length=255, db_index=True)
class Meta:
ordering = ("name_en", "id")
class FirmPhoneNumber(models.Model):
location = models.ForeignKey(Location, db_index=True)
number = PhoneNumberField(db_index=True)
class Meta:
ordering = ("location", "number")
Changing the Locaion's class's Meta.ordering field to ("name_en", ) fixed the query to not have the spurious order by.

These things tend to be by trial and error, but try ordering on firms_location.id rather than firms_firmphonenumber.location_id. They are the same value, but MySQL may then pick up on the index.

It is using it, for the join; that's the 'citiadmin.firms_location.id' value in the ref column. It isn't appearing in possible_keys and key because you have no WHERE clause and it's only reflecting keys it has available for the ORDER BY clause.
If you want to speed up your query, try indexing name_en.

Because there's no where, and because the cardinality of the join field is higher than than of the joining field, it's calculating that it might as well get everything. Using the index on the join won't speed that up, so it's resorting to the lesser optimization of the using an index for sorting.
First, you can do USE to force it to use the index you specify. Also, try doing an optimize to make sure the cardinality is correctly estimated. (I'm guessing you're using INNO, which estimates it in a series of random "dives"; if this is MyISAM, which actually knows, then I wonder why the cardinality looks as it does.)
Don't bother indexing the name or etc. MySQL will use only one index per table per join, ever, and the index will just bulk it up.

how much data? if only a few rows, most databases will just do a table scan no matter what indexes you have

Related

Indexing for complex predicates

I'm struggling to identify effective indexes (or rewrite the query) to improve a query with the following confounding predicates:
JOIN on a date from one table being in range - between two date fields on second table (one is nullable, one is not nullable in PK).
The date used is actually the value in date field (nullable) +1.
WHERE clauses includes OR logic on multiple flag fields.
The simplified version of the query is:
select
d.dim_date_id
,f.dim_provider_id
,f.dim_event_id
,d.date
from DWH.dbo.tbl_fact_outcome f
join DWH.dbo.tbl_dim_date d on DATEADD(DAY,1,d.date) between f.known_from and f.known_to
where
f.known_from > getdate()-12
and (d.flag_latest_day = 'Y' or d.flag_end_of_month = 'Y' or (d.flag_end_of_week = 'Y' AND d.flag_latest_week = 'Y'))
and d.flag_future_day = 'N'
and f.deleted = 0
tbl_fact_outcome has these indexes:
PK clustered index on input_form_id, known_from
Non-unique Nonclustered index on deleted, known_from, known_to (INCLUDES the required _dim_id fields)
tbl_dim_date has these indexes:
PK clustered index on dim_date_id
Non-unique nonclustered index on flag_future_day, date (INCLUDES relevant flag fields)
At present, it estimates 853 rows but returns 16,784.
Here is the query plan:
https://www.brentozar.com/pastetheplan/?id=rydKb_3AI
Statistics are up to date.
I have tried re-ordering the covering indexes but no improvement.
I'm totally stumped as to what else to try with indexes or the code itself to improve performance, so any pointers appreciated.
EDIT 05/07/2020
Ruled out following suggestions here:
Filtered index (on deleted) on tbl_fact_outcome - less than 1% of records would be filtered out, so not worthwhile
Filtered index (using entire WHERE clause from query) on tbl_dim_date - not possible to use OR in index
Index on tbl_dim_date with INCLUDEd fields as key fields - tried this, made no difference, not used by optimizer.

Guessing all/most queries filter on deleted I would suggest a filtered index.
CREATE NONCLUSTERED INDEX TodoNewIndexName ON DWH.dbo.tbl_fact_outcome (
known_from ASC
,known_to ASC
)
INCLUDE (dim_event_id,dim_provider_id)
WHERE deleted = 0;
If this query is really running frequently you could also consider using a filterd index for tbl_dim_date. This will probably only be used by this query, since the where is an exact match of your query:
CREATE NONCLUSTERED INDEX TodoNewIndexName ON DWH.dbo.tbl_dim_date (DATE ASC)
WHERE (
d.flag_latest_day = 'Y'
OR d.flag_end_of_month = 'Y'
OR (
d.flag_end_of_week = 'Y'
AND d.flag_latest_week = 'Y'
)
)
AND d.flag_future_day = 'N'
If you don't want a filterd index on the flag fields. You should add the flag fields to the index instead of includes.
CREATE NONCLUSTERED INDEX TodoNewIndexName ON DWH.dbo.tbl_dim_date (
DATE ASC
,flag_latest_day ASC
,flag_end_of_month ASC
,flag_end_of_week ASC
,flag_latest_week ASC
)
What this should do is get rid of the Index Spool (Eager Spool) more info about eager spool.
Eager index spools are often a sign that a useful permanent index is
missing from the database schema. This is not always the case, as the
streaming table-valued function examples show.

Has your date dimension a nextDay column or something? If not you can add this column and replace DATEADD(DAY,1,d.date) with this new column.

Distinct with long columns

I have here some database schema with tables having long fields (in MS-SQL-Server of type "text", in Sybase of type "text" too) and I need to retrieve distinct rows.
The tables looks like
create table node (id int primary key, … a few more fields … data text);
create table ref (id int primary key, node_id int, … a few more fields);
For one row in "node", there may be zero or more rows in "ref".
Now I have a query like
SELECT node.* FROM node, ref WHERE node.id = ref.node_id AND ... some more restrictions.
This query returns duples and triples when there is more than a single row in "ref" for some "node_id".
But I need unique rows!
Using SELECT DISTINCT node.* does not work because of the columns of type "text" :-(
In Sybase there is trick, just add "GROUP BY node.id" to the query, voila! You get unique rows returned.
Is there some similar simple Trick for MS-SQL-Server?
I have already a solution with temporary tables, but this seems to be a lot slower maybe the reason is just because of the larger number of statements transferred to the database?

It looks like you are approaching this problem from the wrong direction. Joins are typically used to expand on keys where relevant data is stored in different tables. So it's no surprise you are getting more than one row per node_id.
In your query, you join the two tables together, but then you ignore everything from ref. It looks like you're just trying to filter out ids from node that are not referenced in ref. If that is the case, then you don't want to use a join. The following will work much better
select *
from node
where id in (
select node_id
from ref
where [any restrictions placed on the ref table go here]
)
and [any restrictions placed on the node table go here]
Furthermore, at the risk of teaching you bad join practices, the same thing can be accomplished they way you were trying to do it originally, but it's more painful to write and it's not good practice
select node.col1, node.col2, ... , node.last_col
FROM node
inner join ref on node.id = ref.node_id
where [some restrictions.]
group by node.col1, node.col2, ... , node.last_col

SQL Query is slow when ORDER BY statement added

I have a table [Documents] with the following columns:
Name (string)
Status (string)
DateCreated [datetime]
This table has around 1 million records. All three of these columns have an index (a single index for each one).
When I run this query:
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New';
Execution is really fast (300 ms.)
If I run the same query but with the ORDER BY clause, it's really slow (3000 ms)
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New'
order by DateCreated;
I understand that its searching in another index (DateCreated), but should it really be that much slower? If so, why? Anything I can do to speed this query up (a composite index)?
Thanks
BTW: All Indexes including DateCreated have really low fragmentation, in fact I ran a reorganize and it didn't change a thing.

As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.
Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.
If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.
NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY before the TOP 50. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.
A couple of ideas to improve performance
Adding a composite index (as other answers have suggested) may offer some improvement, for example:
ON Documents (Status, DateCreated, Name)
SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.
For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New' is a relatively small subset of the table.
CREATE NONCLUSTERED INDEX Documents_FIX
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'
I would also modify the query to specify ORDER BY Status, DateCreated, Name
so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.
As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that
ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;
CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;
Then the query could be re-written:
SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;

If DateCreated field means insertion time to table, you can create an integer id field and order by that integer field.

You need an index by 2 columns: (Name, DateCreated). The order of fields in the index is important. So, replace your index for just Name with a new index for two columns (Name, DateCreated).

Suitable indexes for sorting in ranking functions

I have a table which keeps parent-child-relations between items. Those can be changed over time, and it is necessary to keep a complete history so that I can query how the relations were at any time.
The table is something like this (I removed some columns and the primary key etc. to reduce noise):
CREATE TABLE [tblRelation](
[dtCreated] [datetime] NOT NULL,
[uidNode] [uniqueidentifier] NOT NULL,
[uidParentNode] [uniqueidentifier] NOT NULL
)
My query to get the relations at a specific time is like this (assume #dt is a datetime with the desired date):
SELECT *
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY r.uidNode ORDER BY r.dtCreated DESC) ix, r.*
FROM [tblRelation] r
WHERE (r.dtCreated < #dt)
) r
WHERE r.ix = 1
This query works well. However, the performance is not yet as good as I would like. When looking at the execution plan, it basically boils down to a clustered index scan (36% of cost) and a sort (63% of cost).
What indexes should I use to make this query faster? Or is there a better way altogether to perform this query on this table?

The ideal index for this query would be with key columns uidNode, dtCreated and included columns all remaining columns in the table to make the index covering as you are returning r.*. If the query will generally only be returning a relatively small number of rows (as seems likely due to the WHERE r.ix = 1 filter) it might not be worthwhile making the index covering though as the cost of the key lookups might not outweigh the negative effects of the large index on CUD statements.

The window/rank functions on SQL Server 2005 are not that optimal sometimes (based on answers here). Apparently better in SQL Server 2008
Another alternative is something like this. I'd have a non-clustered index on (uidNode, dtCreated) INCLUDE any other columns required by SELECT. Subject to what Martin Smith said about lookups.
WITH MaxPerUid AS
(
SELECT
MAX(r.dtCreated) AS MAXdtCreated, r.uidNode
FROM
MaxPerUid
WHERE
r.dtCreated < #dt
GROUP BY
r.uidNode
)
SELECT
...
FROM
MaxPerUid M
JOIN
MaxPerUid R ON M.uidNode = R.uidNode AND M.MAXdtCreated = R.dtCreated

SQL Server won't use my index

I have a fairly simple query:
SELECT
col1,
col2…
FROM
dbo.My_Table
WHERE
col1 = #col1 AND
col2 = #col2 AND
col3 <= #col3
It was performing horribly, so I added an index on col1, col2, col3 (int, bit, and datetime). When I checked the query plan it was ignoring my index. I tried reordering the columns in the index in every possible configuration and it always ignored the index. When I run the query it does a clustered index scan (table size is between 700K and 800K rows) and takes 10-12 seconds. When I force it to use my index it returns instantly. I was careful to clear the cache and buffers between tests.
Other things I’ve tried:
UPDATE STATISTICS dbo.My_Table
CREATE STATISTICS tmp_stats ON dbo.My_Table (col1, col2, col3) WITH FULLSCAN
Am I missing anything here? I hate to put an index hint in a stored procedure, but SQL Server just can’t seem to get a clue on this one. Anyone know any other things that might prevent SQL Server from recognizing that using the index is a good idea?
EDIT: One of the columns being returned is a TEXT column, so using a covering index or an INCLUDE won't work :(

You have 800k rows indexed by col1, col2, col3. Col2 is a bit, so its selectivity is 50%. Col3 is a checked on a range (<=), so it's selectivity will be roughly at about 50% too. Which leaves col1. The query is compiled for the generic, parametrized plan, so it has to account for the general case. If you have 10 distinct values of col1, then your index will return approximately 800k /10 * 25% that is about ~20k keys to lookup in the clustered index to retrieve the '...' part. If you have 10k distinct col1 values then the index will return just 20 keys to look up. As you can see, what matters is not how you build your index in this case, but the actual data. Based on the selectivity of col1, the optimizer will choose a plan based on a clustered index scan (as better than 20k key lookups, each lookup at a cost of at least 3-5 page reads) or one based on the non-clustered index (if col1 is selective enough). In real life the distribution of col1 also plays a role, but going into that would complicate the explanation too much.
You can come with the benefit of hindsight and claim the plan is wrong, but the plan is the best cost estimate based on the data available at compile time. You can influence it with hints (index hint as you suggests, or optimize for hints as Quassnoi suggests) but then your query may perform better for your test set, and far worse for a different set of data, say for the case when #col1 = <the value that matches 500k records>. You can also make the index covering, thus eliminating the '...' in the projection list that require the clustered index lookup necessary, in which case the non-clustered index is always a better cost match than the clustered scan.
Kimberley Tripp has a blog article covering this subject, she calls it the 'index tipping point' which explains how come an apparently perfect candidate index is being ignored: a non-clustered index that does not cover the projection list and has poor selectivity will be seen as more costly than a clustered scan.

SQL Server optimizer is not good in optimizing queries that use variables.
If you are sure that you will always benefit from using the index, just put a hint.
If you will put the literal values to the query instead of variables, it will pick the correct statistics and will use the index.
You may also try to put a more light hint:
OPTION (OPTIMIZE FOR (#col1 = 1, #col2 = 0, #col3 = '2009-07-09'))
, which will calculate the best execution plan for these values of the variables, using statistics, and won't stick to using index no matter what.

The order of the index is important for this query:
CREATE INDEX MyIndex ON MyTable (col3 DESC, col2 ASC, col1 ASC)
It's not so much the ASC/DESC as that when sql server goes to match that where clause, it can match on col3 first and walk the index along that value.

Have you tried tossing out the bit from the index?
create index ix1 on My_Table(Col3, Col1) INCLUDE(Col2)
-- include other columns from the select list if needed
Also, you've left out the rest of the columns from the select list. You might want to consider including those if there aren't many either in the index or as INCLUDE statement to create a covering index for the query.

Try masking your parameters to prevent paramter sniffing:
CREATE PROCEDURE MyProc AS
#Col1 INT
-- etc...
AS
DECLARE #MaskedCol1 INT
SET #MaskedCol1 = #Col1
-- etc...
SELECT
col1,
col2…
FROM
dbo.My_Table
WHERE
col1 = #MaskecCol1 AND
-- etc...
Sounds stupid but I've seen SQL server do some weird things because of parameter sniffing.

I bet SQL Server thinks the price of getting the rest of the columns (designated by ... in your example) from the clustered index outweighs the benefit of the index so it just scans the clustered key. If so, see if you can make this a covering index.
Or does it use another index instead?

Are the columns nullable? Sometimes Sql Server thinks it has to scan the table to find NULL values.
Try adding "and col1 is not null" to the query, it mgiht make sqlserver use the index wtihout hint.
Also, check if the statistics are really up to date:
SELECT
object_name = Object_Name(ind.object_id),
IndexName = ind.name,
StatisticsDate = STATS_DATE(ind.object_id, ind.index_id)
FROM SYS.INDEXES ind
order by STATS_DATE(ind.object_id, ind.index_id) desc

If your SELECT is returning columns that aren't in your index SQL my find that its more efficient to scan the clustered index instead of having to do a key lookup to find the other values that you are requesting.
If you have a TEXT column try switching the data type to VARCHAR(MAX) then including the values in the nonclustered index.