SQL Server post-join rowcount underestimate

SQL Server post-join rowcount underestimate - sql-server

The Query Optimizer is estimating that the results of a join will have only one row, when the actual number of rows is 2000. This is causing later joins on the dataset to have an estimated result of one row, when some of them go as high as 30,000.
With a count of 1, the QO is choosing a loop join/index seek strategy for many of the joins which is much too slow. I worked around the issue by constraining the possible join strategies with a WITH OPTION (HASH JOIN, MERGE JOIN), which improved overall execution time from 60+ minutes to 12 seconds. However, I think the QO is still generating a less than optimal plan because of the bad rowcounts. I don't want to specify the join order and details manually-- there are too many queries affected by this for it to be worthwhile.
This is in Microsoft SQL Server 2000, a medium query with several table selects joined to the main select.
I think the QO may be overestimating the cardinality of the many side on the join, expecting the joining columns between the tables to have less rows in common.
The estimated row counts from scanning the indexes before the join are accurate, it's only the estimated row count after certain joins that's much too low.
The statistics for all the tables in the DB are up to date and refreshed automatically.
One of the early bad joins is between a generic 'Person' table for information common to all people and a specialized person table that about 5% of all those people belong to. The clustered PK in both tables (and the join column) is an INT. The database is highly normalized.
I believe that the root problem is the bad row count estimate after certain joins, so my main questions are:
How can I fix the QO's post join rowcount estimate?
Is there a way that I can hint that a join will have a lot of rows without specifying the entire join order manually?

Although the statistics were up to date, the scan percentage wasn't high enough to provide accurate information. I ran this on each of the base tables that was having a problem to update all the statistics on a table by scanning all the rows, not just a default percentage.
UPDATE STATISTICS <table> WITH FULLSCAN, ALL
The query still has a lot of loop joins, but the join order is different and it runs in 2-3 seconds.

can't you prod the QO with a well-placed query hint?

Related

Partitioning table that is typically only inserted to in SQL Server 2012

I have a table that is having approximately 450,000 records per month going into it. It is an audit table of sorts that tracks changes to other tables in the database. That is, inserts, updates and deletes of records. Typically this table is not queried (perhaps only 2-3 times per month to examine how data in other tables changed and under very specific circumstances)
It has been put to me that we should consider partitioning this table to help improve database performance. If the table is only being inserted to 99.9% of the time and rarely queried, would there be any tangible benefit to this partitioning this table?
Thanks.

If the table is only being inserted to 99.9% of the time and rarely
queried, would there be any tangible benefit to this partitioning this
table?
Partitioning is mostly a manageability feature. I would expect no difference in insert performance with or without able partitioning. For SELECT queries, partitioning may improve performance of large scans if partitions can be eliminated (i.e. partitioning column specified in WHERE clause, but indexing and query tuning is usually the key to performance.
Partitioning can improve performance of purge operations. For example, you could use a monthly sliding window to purge an entire month of data at once rather than individual row deletes. I don't know if that's with the trouble with only 450K rows/month, though.

I think you want to get fast access to your recent data.
Add date column as first column in primary key clustered instead of partioning

How to calculate Join cost? I want to know the disk operations?

I want some help regarding join processing
Nested Loop Join
Block Nested loop join
Merge join
Hash join
I search but did not find some link which also provide mathematical examples of calculation?
e.g.
Consider the natural join R & S of relations R and S, with the following information about those relations:
Relation R contains 8,000 records and has 10 records per page
Relation S contains 2,000 records and has 10 records per page
Both relations are stored as sorted files on the join attribute
how many disk operations would it take to process the upper four joins?

Do you have a specific dbms in mind?
For Oracle, you'd have to know the block size, the configuration for db_file_multiblock_read_count and the expected nr of blocks already in cache, the high watermark for each table, existing indexes and their clustering factor to mention a few things that will affect the answer.
As a general rule, whenever I fully join two tables, I expect to see two full table scans and a hash join. Whenever I join parts of two tables, I expect to see a nested loop driven from the table with the most selective filter predicate.
Whenever I get surprised, I investigate the statistics and the above mentiond things to validate the optimizer choice.

Same query has nested loops when used with INSERT, but Hash Match without

I have two tables, one has about 1500 records and the other has about 300000 child records. About a 1:200 ratio. I stage the parent table to a staging table, SomeParentTable_Staging, and then I stage all of it's child records, but I only want the ones that are related to the records I staged in the parent table. So I use the below query to perform this staging by joining with the parent tables staged data.
--Stage child records
INSERT INTO [dbo].[SomeChildTable_Staging]
([SomeChildTableId]
,[SomeParentTableId]
,SomeData1
,SomeData2
,SomeData3
,SomeData4
)
SELECT [SomeChildTableId]
,D.[SomeParentTableId]
,SomeData1
,SomeData2
,SomeData3
,SomeData4
FROM [dbo].[SomeChildTable] D
INNER JOIN dbo.SomeParentTable_Staging I ON D.SomeParentTableID = I.SomeParentTableID;
The execution plan indicates that the tables are being joined with a Nested Loop. When I run just the select portion of the query without the insert, the join is performed with Hash Match. So the select statement is the same, but in the context of an insert it uses the slower nested loop. I have added non-clustered index on the D.SomeParentTableID so that there is an index on both sides of the join. I.SomeParentTableID is a primary key with clustered index.
Why does it use a nested loop for inserts that use a join? Is there a way to improve the performance of the join for the insert?

A few thoughts:
Make sure your statistics are up to date. Bad statistics account for many of the bizarre "intermittent" query plan problems.
Make sure your indexes are covering, otherwise there's a much higher probability of the optimizer ignoring them.
If none of that helps, you can always force a specific join by writing INNER HASH JOIN as opposed to just INNER JOIN.

Does the destination table have a clustered index? The choice of join may be necessary to facilitate the ordering of the data in the insert. I've seen execution plans differ depending on whether the destination table has a clustered index and what column(s) it is on.

what is the fastest way of getting table record count with condition on SQL Server

As per subject, i am looking for a fast way to count records in a table without table scan with where condition

There are different methods, the most reliable one is
Select count(*) from table_name
But other than that you can also use one of the followings
select sum(1) from table_name
select count(1) from table_name
select rows from sysindexes where object_name(id)='table_name' and indid<2
exec sp_spaceused 'table_name'
DBCC CHECKTABLE('table_name')
The last 2 need sysindexes to be updated, run the following to achieve this, if you don't update them is highly likely it'll give you wrong results, but for an approximation they might actually work.
DBCC UPDATEUSAGE ('database_name','table_name') WITH COUNT_ROWS.
EDIT: sorry i did not read the part about counting by a certain clause. I agree with Cruachan, the solution for your problem are proper indexes.

The following page list 4 methods of getting the number of rows in a table with commentary on accuracy and speed.
http://blogs.msdn.com/b/martijnh/archive/2010/07/15/sql-server-how-to-quickly-retrieve-accurate-row-count-for-table.aspx
This is the one Management Studio uses:
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))

Simply, ensure that your table is correctly indexed for the where condition.
If you're concerned over this sort of performance the approach is to create indexes which incorporate the field in question, for example if your table contains a primary key of foo, then fields bar, parrot and shrubbery and you know that you're going to need to pull back records regularly using a condition based on shrubbery that just needs data from this field you should set up a compound index of [shrubbery, foo]. This way the rdbms only has to query the index and not the table. Indexes, being tree structures, are far faster to query against than the table itself.
How much actual activity the rdbms needs depends on the rdbms itself and precisely what information it puts into the index. For example, a select count()* on an unindexed table not using a where condition will on most rdbms's return instantly as the record count is held at the table level and a table scan is not required. Analogous considerations may hold for index access.
Be aware that indexes do carry a maintenance overhead in that if you update a field the rdbms has to update all indexes containing that field too. This may or may not be a critical consideration, but it's not uncommon to see tables where most activity is read and insert/update/delete activity is of lesser importance which are heavily indexed on various combinations of table fields such that most queries will just use the indexes and not touch the actual table data itself.
ADDED: If you are using indexed access on a table that does have significant IUD activity then just make sure you are scheduling regular maintenance. Tree structures, i.e. indexes, are most efficient when balanced and with significant UID activity periodic maintenance is needed to keep them this way.

When does a database table get large enough that an index is beneficial?

Hypothetically, in a SQL Server database, if I have a table with two int fields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the two int fields overcomes the overhead imposed by said indexes?
Are there differences in architecture between different versions of SQL Server that would substantially change this answer?

For the queries involving small portions of the table rows, indexes are always beneficial, be there 100 rows or 1,000,000.
See this entry in my blog for examples with plans and performance details:
Indexing tiny tables
The queries like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
will most probably use HASH JOIN. A hash table for the smaller table will be built, and the rows from the larger table will be used to probe the hash table.
To do this, no index is needed.
However, this query:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
WHERE t1.othercol = #value
will use NESTED LOOPS: the rows from the outer table (table1) will be searched using an index on table1.othercol, and the rows from the inner table (table2) will be searched using an index on table2.col.
If you don't have an index on col1, a HASH JOIN will be used which requires scanning all rows from both tables and some more resources to built a hash table.
Indexes are also useful for the queries like this:
SELECT t2.col
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
, in which case the engine doesn't need to read table2 itself at all: eveything you need for this query can be found in the index, which can be much smaller than the table itself and more efficient to read.
And, of course, if you need your data sorted and have indexes on both table1.col and table2.col, then the following query:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
ORDER BY
t2.col
will probably use MERGE JOIN method, which is super fast if both input rowset are sorted, and its output is also sorted, which means that ORDER BY comes free.
Note that even if you don't have an index, an optimizer may choose to Eager Spool your small table, which means building a temporary index for the duration of the query and dropped the index after the query finishes.
If the query is small, it will be very fast, but again, an index won't hurt (for SELECT queries I mean). If the optimizer won't need it, it just will not be used.
Note, though, that creating an index may affect DML performance, but it's other story.

it depends on the selectivity of your data, if your data is not selective enough then the index might not even be used since the cost would be too expensive. If you have only 2 values in the table and these values are evenly distributed then you will get a scan not a seek
I still believe every table should have a Primary Key, if you have that then you also have an index already

The penalty for insertion will be negligible until long after the benefit of the indexes will appear. The optimizer is smart enough to ignore the indexes anyway until that point clicks in. So just index the table from the start.

Regardless of size, there is always a performance benefit to using an index when doing a lookup.
Regarding overhead, the question becomes: what overhead do you mean, and how do you relate it to the value of a lookup? The two are separate values, after all.
There are two forms of overhead for an index: space (which is usually negligible, depending on how the index is structured), and re-index on insert (the server must recalculate an index after every insert).
As I mentioned, the space issue probably isn't that big a deal. But re-indexing is. Fortunately, you need to be doing lots of near-continuous inserting before that form of overhead becomes a problem.
So bottom line: You're almost always better off having an index. Start from that position and wait until re-indexing becomes a bottleneck. Then you can look into alternatives.

The index will nearly always increase the performance of the query, at the cost of extra memory and performance cost for insert/deletion (since it needs to maintain the index at that point). Profiling will be the only definite way to tell whether or not the index, in your particular case, is beneficial.
In general, you're trading memory for speed when you create an index (other than the additional cost of insertion). If you're doing many queries (selects or updates) relative to the number of inserted/deleted rows, indexes will pretty much always increase your performance.

Another thing to think about is the concept of coding performance-- sometimes having an index can streamline the mental overhead of thinking about how to manage the relationship between different pieces of data. sometimes it can complicate it...

A very useful link:
"The Tipping Point Query Answers"
http://www.sqlskills.com/BLOGS/KIMBERLY/post/The-Tipping-Point-Query-Answers.aspx

The best thing is to let the server itself figure it out. You create index in the columns where it makes sense(Im sure there's entire chapters if not books on how to do this the best way), and let the SQL server figure out when/how to use the index.
In many cases, when optimizing, you'd need to read the docs of your particular DBMS to learn more how it uses indexes, and relate that to the queries the application you're optimizing uses. Then you can fine tune the index usage.

I believe as soon as you start doing joins on those int fields your table is big enough. If the table is small enough that it wouldn't benefit from an index then the overhead wouldn't be significant enough that you would want to opt out.
When I think about the overhead due to an index I usually consider how often the table index will be changing--through inserts, deletes and updates to indexed columns.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight