I want some help regarding join processing
Nested Loop Join
Block Nested loop join
Merge join
Hash join
I search but did not find some link which also provide mathematical examples of calculation?
e.g.
Consider the natural join R & S of relations R and S, with the following information about those relations:
Relation R contains 8,000 records and has 10 records per page
Relation S contains 2,000 records and has 10 records per page
Both relations are stored as sorted files on the join attribute
how many disk operations would it take to process the upper four joins?
Do you have a specific dbms in mind?
For Oracle, you'd have to know the block size, the configuration for db_file_multiblock_read_count and the expected nr of blocks already in cache, the high watermark for each table, existing indexes and their clustering factor to mention a few things that will affect the answer.
As a general rule, whenever I fully join two tables, I expect to see two full table scans and a hash join. Whenever I join parts of two tables, I expect to see a nested loop driven from the table with the most selective filter predicate.
Whenever I get surprised, I investigate the statistics and the above mentiond things to validate the optimizer choice.
Related
In a report I have the next join from a FACT table:
Join…
LEFT JOIN DimState AS s
ON s.StateCode = l.Province AND l.Locale LIKE (s.CountryCode + '%')
More information:
Fact table has 59,567,773 rows
L.Province can match a StateCode in DimState: 42,346,471 rows 71%
L.Province can’t match a StateCode in DimState: 13,742,966 rows 23% (most of them are a blank value in L.Province).
L.Province is NULL in 3,500,000 rows (6%)
4 questions:
-The correct thing to do, would be to replace L.Province Nulls and blanks for “other”… And have an entry in DimState, with StateCode “other”, right?
-Is it acceptable to LEFT JOIN to a dimension? Or it should always be INNER JOIN?
-Is it correct to join to a dimension on 2 columns?
-To do a l.Locale = s.CountryCode… Should I modify the values in l.Locale or in s.CountryCode?
In order of your four questions:
Yes, you should not have blanks for dimension keys in your fact tables. If the value in the source data is in fact null or empty, there should be members in your dimension tables which are set aside to reflect this.
Therefore, building off 1, you should GENERALLY not do left joins when joining facts to dimensions. I say generally because there might be a situation where this is necessary, but I can't think of anything of the top of my head. You should not have to with properly designed fact and dimension tables.
Generally, no. I would recommend using a surrogate key in this case since your business key is spread across two columns.
Not sure what you are asking here. If you keep this design, you would need to change both. If you switch to using a surrogate key for DimState, you would only have to update the dimension table whenever anything changes.
To build on what mallan1121 said:
1:There are generally three different meanings for null/blank in data warehousing.
A. I don't know the value
B. The value is known and it is blank
C. The value does not apply.
Make sure you consider the relevance for each option as you design your warehouse. The fact should ALWAYS reference a dimension key or you will end up with data quality issues.
2: It can be useful to use left joins if you are abstracting your tables from your cube using views (a good idea) and if you may use those views for non-cube reporting. The reason is that an inner join is a filtering join and the result set is filtered by all inner joined tables even if only a single column is returned.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
JOIN DimA
JOIN DimB --filters result
JOIN DimC --filters result
If you use a left join and you only want columns from the some of the tables, the other joins are ignored and those tables are never accessed.
SELECT DimA.COLUMN, Fact.COLUMN
FROM Fact
LEFT JOIN DimA
LEFT JOIN DimB --ignored
LEFT JOIN DimC --ignored
This can speed up reporting querys run directly against the SQL database. However, you must make sure your ETL process enforces the integrity and that the results returned are identical whether inner or left joins are used.
4: Requiring multiple columns in the join is not a problem, but I'd be very concerned about a multiple column join using a wildcard. I expect you have a granularity issue in your dimension. I don't know your data, but using a wildcard risks getting multiple values back from that dimension.
Do not do this from one simple reason. You will get 13M records with the key L.Province = 'Other' in you dimension table - each record from the fact table with s.StateCode = 'Other' will be joined with those 13M dimension records, leading to massive duplication of the measures.
The proper answer is enforce the primary key on your dimension. Typically a dimnsion have one record with the key other (meaning the key is not known) and possible one other recrod NA (the dimension has no meaning in for this fact record).
The problem is not in the OUTER join- what should be enforced by design is that all foreign key in the fact table are defined in the dimension table.
One step to achieve this is the definition of NA and Other as decribed in 1.
The rationale behind this approach is to enforce that INNER and OUTER joins lead to the same result, i.e. do not cause confusion with different results.
Again each dimension should have defined a PRIMARY KEY - if the PK consist of two columns - the join on those columns is fine. (Typical scenario in DWh though is a single column numeric PK).
What should be avioded is join on LIKEor SUBSTR - this points that the dimension PK is not well defined.
If your dimension has a two column PK Locale + province the fact table must be updated to contain this two column as a FK.
In the full text search page http://msdn.microsoft.com/en-us/library/ms189760.aspx on MSDN it says that if you want to do a full text search on multiple tables just "use a joined table in your FROM clause to search on a result set that is the product of two or more tables."
My question is, isn't this going to be really slow if you have to merge two very large tables?
If I'm merging a product table with a category table and there are millions of records, won't the join take a long time and then have to search after the join?
Joins on millions of records can still be fast if the join is optimized for performance, for example, a single int column that is indexed in both tables. But there can be other factors at play so the best approach is to try it and gauge the performance yourself.
If the join doesn't perform well, you have a couple of options:
Create a view of the tables joined together, create a full text index on that view, and run your full text queries against that view.
Create a 3rd table which is a combination of the 2 tables you are joining, create a full text index on it, and run your full text queries against that table. You'll need something like an ETL process to keep it updated.
As per subject, i am looking for a fast way to count records in a table without table scan with where condition
There are different methods, the most reliable one is
Select count(*) from table_name
But other than that you can also use one of the followings
select sum(1) from table_name
select count(1) from table_name
select rows from sysindexes where object_name(id)='table_name' and indid<2
exec sp_spaceused 'table_name'
DBCC CHECKTABLE('table_name')
The last 2 need sysindexes to be updated, run the following to achieve this, if you don't update them is highly likely it'll give you wrong results, but for an approximation they might actually work.
DBCC UPDATEUSAGE ('database_name','table_name') WITH COUNT_ROWS.
EDIT: sorry i did not read the part about counting by a certain clause. I agree with Cruachan, the solution for your problem are proper indexes.
The following page list 4 methods of getting the number of rows in a table with commentary on accuracy and speed.
http://blogs.msdn.com/b/martijnh/archive/2010/07/15/sql-server-how-to-quickly-retrieve-accurate-row-count-for-table.aspx
This is the one Management Studio uses:
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))
Simply, ensure that your table is correctly indexed for the where condition.
If you're concerned over this sort of performance the approach is to create indexes which incorporate the field in question, for example if your table contains a primary key of foo, then fields bar, parrot and shrubbery and you know that you're going to need to pull back records regularly using a condition based on shrubbery that just needs data from this field you should set up a compound index of [shrubbery, foo]. This way the rdbms only has to query the index and not the table. Indexes, being tree structures, are far faster to query against than the table itself.
How much actual activity the rdbms needs depends on the rdbms itself and precisely what information it puts into the index. For example, a select count()* on an unindexed table not using a where condition will on most rdbms's return instantly as the record count is held at the table level and a table scan is not required. Analogous considerations may hold for index access.
Be aware that indexes do carry a maintenance overhead in that if you update a field the rdbms has to update all indexes containing that field too. This may or may not be a critical consideration, but it's not uncommon to see tables where most activity is read and insert/update/delete activity is of lesser importance which are heavily indexed on various combinations of table fields such that most queries will just use the indexes and not touch the actual table data itself.
ADDED: If you are using indexed access on a table that does have significant IUD activity then just make sure you are scheduling regular maintenance. Tree structures, i.e. indexes, are most efficient when balanced and with significant UID activity periodic maintenance is needed to keep them this way.
Hypothetically, in a SQL Server database, if I have a table with two int fields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the two int fields overcomes the overhead imposed by said indexes?
Are there differences in architecture between different versions of SQL Server that would substantially change this answer?
For the queries involving small portions of the table rows, indexes are always beneficial, be there 100 rows or 1,000,000.
See this entry in my blog for examples with plans and performance details:
Indexing tiny tables
The queries like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
will most probably use HASH JOIN. A hash table for the smaller table will be built, and the rows from the larger table will be used to probe the hash table.
To do this, no index is needed.
However, this query:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
WHERE t1.othercol = #value
will use NESTED LOOPS: the rows from the outer table (table1) will be searched using an index on table1.othercol, and the rows from the inner table (table2) will be searched using an index on table2.col.
If you don't have an index on col1, a HASH JOIN will be used which requires scanning all rows from both tables and some more resources to built a hash table.
Indexes are also useful for the queries like this:
SELECT t2.col
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
, in which case the engine doesn't need to read table2 itself at all: eveything you need for this query can be found in the index, which can be much smaller than the table itself and more efficient to read.
And, of course, if you need your data sorted and have indexes on both table1.col and table2.col, then the following query:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
ORDER BY
t2.col
will probably use MERGE JOIN method, which is super fast if both input rowset are sorted, and its output is also sorted, which means that ORDER BY comes free.
Note that even if you don't have an index, an optimizer may choose to Eager Spool your small table, which means building a temporary index for the duration of the query and dropped the index after the query finishes.
If the query is small, it will be very fast, but again, an index won't hurt (for SELECT queries I mean). If the optimizer won't need it, it just will not be used.
Note, though, that creating an index may affect DML performance, but it's other story.
it depends on the selectivity of your data, if your data is not selective enough then the index might not even be used since the cost would be too expensive. If you have only 2 values in the table and these values are evenly distributed then you will get a scan not a seek
I still believe every table should have a Primary Key, if you have that then you also have an index already
The penalty for insertion will be negligible until long after the benefit of the indexes will appear. The optimizer is smart enough to ignore the indexes anyway until that point clicks in. So just index the table from the start.
Regardless of size, there is always a performance benefit to using an index when doing a lookup.
Regarding overhead, the question becomes: what overhead do you mean, and how do you relate it to the value of a lookup? The two are separate values, after all.
There are two forms of overhead for an index: space (which is usually negligible, depending on how the index is structured), and re-index on insert (the server must recalculate an index after every insert).
As I mentioned, the space issue probably isn't that big a deal. But re-indexing is. Fortunately, you need to be doing lots of near-continuous inserting before that form of overhead becomes a problem.
So bottom line: You're almost always better off having an index. Start from that position and wait until re-indexing becomes a bottleneck. Then you can look into alternatives.
The index will nearly always increase the performance of the query, at the cost of extra memory and performance cost for insert/deletion (since it needs to maintain the index at that point). Profiling will be the only definite way to tell whether or not the index, in your particular case, is beneficial.
In general, you're trading memory for speed when you create an index (other than the additional cost of insertion). If you're doing many queries (selects or updates) relative to the number of inserted/deleted rows, indexes will pretty much always increase your performance.
Another thing to think about is the concept of coding performance-- sometimes having an index can streamline the mental overhead of thinking about how to manage the relationship between different pieces of data. sometimes it can complicate it...
A very useful link:
"The Tipping Point Query Answers"
http://www.sqlskills.com/BLOGS/KIMBERLY/post/The-Tipping-Point-Query-Answers.aspx
The best thing is to let the server itself figure it out. You create index in the columns where it makes sense(Im sure there's entire chapters if not books on how to do this the best way), and let the SQL server figure out when/how to use the index.
In many cases, when optimizing, you'd need to read the docs of your particular DBMS to learn more how it uses indexes, and relate that to the queries the application you're optimizing uses. Then you can fine tune the index usage.
I believe as soon as you start doing joins on those int fields your table is big enough. If the table is small enough that it wouldn't benefit from an index then the overhead wouldn't be significant enough that you would want to opt out.
When I think about the overhead due to an index I usually consider how often the table index will be changing--through inserts, deletes and updates to indexed columns.
The Query Optimizer is estimating that the results of a join will have only one row, when the actual number of rows is 2000. This is causing later joins on the dataset to have an estimated result of one row, when some of them go as high as 30,000.
With a count of 1, the QO is choosing a loop join/index seek strategy for many of the joins which is much too slow. I worked around the issue by constraining the possible join strategies with a WITH OPTION (HASH JOIN, MERGE JOIN), which improved overall execution time from 60+ minutes to 12 seconds. However, I think the QO is still generating a less than optimal plan because of the bad rowcounts. I don't want to specify the join order and details manually-- there are too many queries affected by this for it to be worthwhile.
This is in Microsoft SQL Server 2000, a medium query with several table selects joined to the main select.
I think the QO may be overestimating the cardinality of the many side on the join, expecting the joining columns between the tables to have less rows in common.
The estimated row counts from scanning the indexes before the join are accurate, it's only the estimated row count after certain joins that's much too low.
The statistics for all the tables in the DB are up to date and refreshed automatically.
One of the early bad joins is between a generic 'Person' table for information common to all people and a specialized person table that about 5% of all those people belong to. The clustered PK in both tables (and the join column) is an INT. The database is highly normalized.
I believe that the root problem is the bad row count estimate after certain joins, so my main questions are:
How can I fix the QO's post join rowcount estimate?
Is there a way that I can hint that a join will have a lot of rows without specifying the entire join order manually?
Although the statistics were up to date, the scan percentage wasn't high enough to provide accurate information. I ran this on each of the base tables that was having a problem to update all the statistics on a table by scanning all the rows, not just a default percentage.
UPDATE STATISTICS <table> WITH FULLSCAN, ALL
The query still has a lot of loop joins, but the join order is different and it runs in 2-3 seconds.
can't you prod the QO with a well-placed query hint?