Netezza - What is colocation in the perspective of SQL? - netezza

I know that colocation is important for distributed joins in Netezza. At a high level, it has the following definition:
All data for joins are located in the same SPU
I also talked to some Netezza employees in the past and they mentioned that a join is considered colocated if all tables distribute and join on the same columns.
However, I still feel that definition is a bit lacking ... Based on my understanding of 1-phase and 2-phase GROUP BY's, I suspect colocation really operates on the following definition:
A join is considered colocated if the set of columns used in the join condition is a superset of the distribution keys of all participating tables.
Is that a correct definition? I tried searching for a precise definition of colocation in NZ but all I got is a bunch of articles that kinda assume you know the definition already.
Input on this would be appreciated. Thanks!
Edit: Based on ScottMcG's suggestion, I reformulated the definition of colocated join as:
1. It must be a HASH or MERGE SORT JOIN
2. Set of columns in join conditions must be superset of all distribution keys of all participating tables
3. ?
The ? mark for #3 is an ambiguity I need to iron out. Accordng to ScottMcG, the distribution keys of each table must also be joined with each other.
Suppose Tables A, B, C are distributed on text columns A.C1, A.C2, B.C3, B.C4, C.C5, and C.C6 and we have the following join.
SELECT * FROM A
INNER JOIN B "Join1"
ON A.C1=B.C3
INNER JOIN C "Join2"
ON A.C2=B.C4
AND A.C2=C.C6
AND [X]
Now, let us provide a few possible definitions of [X]. Then for which definitions of [X] will Join2 be colocated?
(1) [X] = A.C2 = 5
(2) [X] = A.C2 = B.C1 OR A.C2 = C.C5
(3) [X] = A.C1 IS NULL
(4) [X] = A.NonKeyColumn1 = B.NonKeyColumn2

For Netezza, a join is considered to be colocated when the tables involved in the join do not need to be redistributed or broadcast from the data slices on which they permanently reside in order to perform the join.
This can only happen if:
The set of a columns required by the join are a superset of the columns in the distribution key of each table
Each table participating in the join has the same set of columns as their distribution key.
The join is an equi-join.
These conditions are pretty close to what you propose in your definition, and are necessary to allow, but not sufficient to insure, a colocated join. It is possible that the optimizer might decide to pre-broadcast one of the tables if it were small enough even though they are distributed on the same columns, and that would then technically not be a colocated join.
One caveat I should add is that for a column to be considered the "same" as another column, the column values should hash to the same value. Generally speaking this means that the column data types would be the same. An exception is that the integer family of datatypes (byteint, smallint, int, bigint) will hash to the same value as long as they are in the supported range.
With regard to the effect of types of joins, equijoins would be of this form. Note that this could either be a hash join or a merge sort join (if the data types were perhaps floating point) under the covers. In either case, we don't need to redistribute the data. In these examples, both tables are distributed on COL1.
SELECT ...
FROM TableA A
JOIN TableB B
ON A.COL1 = B.COL1
If the join is an expression based join like either of the following, then you will end up with a redistribution or broadcast of the data. For the "less than" join, you have to be able to determine that 8 is less than 9, but since they will both be hashed to different data slices, they can only be compared if one is relocated to the other.
SELECT ...
FROM TableA A
JOIN TableB B
ON A.COL1 < B.COL1
SELECT ...
FROM TableA A
JOIN TableB B
ON A.COL1 - B.COL1 = 0

Related

Is there an equivalent to OR clause in CONTAINSTABLE - FULL TEXT INDEX

I am trying to find a solution in order to improve the String searching process and I selected FULL-TEXT INDEX Strategy.
However, after implementing it, I still can see there is a performance hit when it comes to search by using multiple strings using multiple Full-Text Index tables with OR clauses.
(E.x. WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%'))
As a solution, I am trying to use CONTAINSTABLE expecting a performance improvement.
Now, I am facing an issue with CONTAINSTABLE when it comes to joining tables with a LEFT JOIN
Please go through the example below.
Query 1
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
LEFT JOIN CONTAINSTABLE(P.Building,*,'%John%') AS FFTIndex ON F.ID = FFTIndex.[Key]
LEFT JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
LEFT JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
LEFT JOIN P.Person p ON pr2.ID = p.PID
LEFT JOIN CONTAINSTABLE(P.Person,FirstName,'%John%') AS PFTIndex ON P.ID = PFTIndex.[Key]
WHERE F.Name IS NOT NULL
This produces the below result.
Query 2
SELECT F.Name,p.*
FROM P.Role PR
INNER JOIN P.Building F ON PR.PID = F.PID
INNER JOIN P.Relationship PRSHIP ON PR.id = prship.ToRoleID
INNER JOIN P.Role PR2 ON PRSHIP.ToRoleID = PR2.ID
INNER JOIN P.Person p ON pr2.ID = p.PID
WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%')
AND F.Name IS NOT NULL
Result
Expectation
To use query 1 in a way that works as the behavior of an SQL SERVER OR clause. As I can understand Query 1's CONTAINSTABLE, joins the data with the building table, and the rest of the results are going to ignore so that the CONTAINSTABLE of the Person table gets data that already contains the keyword filtered from the building table.
If the keyword = Building, I want to match the keyword in both the tables regardless of searching a saved record in both the tables. Having a record in each table is enough.
Summary
Query 2 performs well but is creates a slowness when the words in the indexes are growing. Query 1 seems optimized(When it comes to multiple online resources and MS Documentation),
however, it does not give me the expected output.
Is there any way to solve this problem?
I am not strictly attached to CONTAINSTABLE. Suggesting another optimization method will also be considerable.
Thank you.
Hard to say definitively without your full data set but a couple of options to explore
Remove Invalid % Wildcards
Why are you using '%SearchTerm%'? Does performance improve if you use the search term without the wildcards (%)? If you want a word that matches a prefix, try something like
WHERE CONTAINS (String,'"SearchTerm*"')
Try Temp Tables
My guess is CONTAINS is slightly faster than CONTAINSTABLE as it doesn't calculate a rank, but I don't know if anyone has ever attempted to benchmark it. Either way, I'd try saving off the matches to a temp table before joining up to the rest of the tables. This will allow the optimizer to create a better execution plan
SELECT ID INTO #Temp
FROM YourTable
WHERE CONTAINS (String,'"SearchTerm"')
SELECT *
FROM #Temp
INNER JOIN...
Optimize Full Text Index by Removing Noisy Words
You might find you have some noisy words aka words that reoccur many times in your data that are meaningless like "the" or perhaps some business jargon. Adding these to your stop list will mean your full text index will ignore them, making your index smaller thus faster
The query below will list indexed words with the most frequent at the top
Select *
From sys.dm_fts_index_keywords(Db_Id(),Object_Id('dbo.YourTable') /*Replace with your table name*/)
Order By document_count Desc
This OR That Criteria
For your WHERE CONTAINS(F.*,'%Gayan%') OR CONTAINS(P.FirstName,'%John%') criteria where you want this or that, is tricky. OR clauses generally perform even when using simple equality operators.
I'd try either doing two queries and union the results like:
SELECT * FROM Table1 F
/*Other joins and stuff*/
WHERE CONTAINS(F.*,'%Gayan%')
UNION
SELECT * FROM Table2 P
/*Other joins and stuff*/
WHERE CONTAINS(P.FirstName,'%John%')
OR this is much more work, but you could load all your data into giant denormalized table with all your columns. Then apply a full text index to that table and adjust your search criteria that way. It'd probably be the fastest method searching, but then you'd have to ensure the data is sync between the denormalized table and the underlying normalized tables
SELECT B.*,P.* INTO DenormalizedTable
FROM Building AS B
INNER JOIN People AS P
CREATE FULL TEXT INDEX ft ON DenormalizedTable
etc...

Is this join overcomplicated?

I have inherited an application made by a previous developer. Some of the database calls are running slow in places where there is a large amount of data. I have found in general the SQL code is well written but there are places that make me think, 'what the..?'
Here is one example:
select a.*
from bs_ResearchEnquiry a
left join bs_StateWorkflowState_Map b
on (
select c.MapId from bs_StateWorkflowState_Map c
where c.StateId = a.StateId AND c.StateWorkflowId = a.StateWorkflowId
)=b.MapId
where
b.IsFinal=1
The MapId field is a unique primary key to the bs_StateWorkflowState_Map table.
StateId and StateWorkflowId together also form a unique key.
There will always be a match on these keys to rows in the foreign table bs_ResearchEnquiry
Therefore, could I rewrite the left join more efficiently, and safely, as:
inner join bs_StateWorkflowState_Map b
on b.StateId = a.StateId AND b.StateWorkflowId = a.StateWorkflowId
Or was the original developer trying to achieve something I've missed ?
Your simplification looks good to me. Note that the presence of:
where b.IsFinal = 1
Means that the outer join is effectively inner join.
With your explanation on keys given, you are right, the query can be simplified. It selects records from bs_ResearchEnquiry where the associated bs_StateWorkflowState_Map record is final. So use EXISTS:
select *
from bs_ResearchEnquiry re
where exists
(
select *
from bs_StateWorkflowState_Map m
where m.StateId = re.StateId
and m.StateWorkflowId = re.StateWorkflowId
and m.IsFinal = 1
);
(From your explanation on uniqueness, I gather that there already exist indexes on (StateId, StateWorkflowId) in both tables. If not, create them.)

Database Join Performance Comparison [duplicate]

This question already has answers here:
Explicit vs implicit SQL joins
(12 answers)
Closed 7 years ago.
I am on a project where much of the queries are performed by including multiple tables in the FROM clause. I know this is legal, but I have always used explicit JOINs instead.
For example, two tables (using SQL Server DDL)
CREATE TABLE Manufacturers(
ManufacturerID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
Name varchar(100))
CREATE TABLE Cars (
ModelID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
ManufacturerID INT CONSTRAINT FOREIGN KEY FK_Manufacturer REFERENCES Manufacturers(ManufacturerID),
ModelName VARCHAR(100))
If I want to find the models for GM, I could do either:
SELECT ModelName FROM Cars c, Manufacturers m WHERE c.ManufacturerID=m.ManufacturerID AND m.Name='General Motors'
or
SELECT ModelName FROM Cars c INNER JOIN Manufacturers m ON c.ManufacturerID=m.ManufacturerID WHERE m.Name='General Motors'
My question is this: does one form perform better than the other? Aside from how the tables are defined in Oracle vs SQL Server, does one form of join work better than the other in Oracle or SQL Server? What if you include more tables, say 3 or 4? Does that change the performance characteristics, assuming the queries are constructed to return an equivalent record set?
My question is this: does one form perform better than the other?
They should not. You can check your execution plans to be certain, but every RDBMS I've worked with generates the same plans for comma (ANSI-89) joins as they do for ANSI-92 explicit joins. (Note that comma joins didn't stop being ANSI SQL, it's just that ANSI-92 is where the explicit syntax first appeared.)
Aside from how the tables are defined in Oracle vs SQL Server, does one form of join work better than the other in Oracle or SQL Server?
As far as the server is concerned, no.
What if you include more tables, say 3 or 4? Does that change the performance characteristics, assuming the queries are constructed to return an equivalent record set?
It's possible. With comma joins, I'm not sure it's possible to control the JOIN order with parentheses like you can with explicit joins:
SELECT *
FROM Table1 t1
INNER JOIN (
Table2 t2
INNER JOIN Table3 t3
t2.id = t3.id)
ON t1.id = t2.id
This can affect overall query performance (for better or worse). I'm not sure how to accomplish the same level of control with comma joins, but I've never fully learned comma join syntax. I don't know if you can say Table1 t1, (Table2 t2, Table3 t3), but I don't believe you can. I think you'd have to use subqueries to do that.
The primary advantages of explicit joins are:
Easier to read. It makes it very clear which conditions are used with which join. You won't ever see Table1 t1, Table2 t2, Table3 t3 and then have to dig into the WHERE clause to figure out if one of those joins is an outer join. It also means the WHERE clause isn't stuffed full of all these join conditions you know you don't care about changing when you modify a query.
Easier to use outer joins. In the case of outer joins, you can even specify literal filter values in the outer table without having to handle nulls from the outer join.
Easier to reuse existing joins. If you just want to query from the same relations, you can just grab the FROM clause. You don't have to worry about what bits from the WHERE clause you want and what bits you don't.
Identical syntax across RDBMSs. When you spend all day switching between Oracle and SQL Server, the last thing you want to worry about is confusing + and *= to get your outer joins right.
All of the above make the explicit join syntax more maintainable, which is a very important factor for software.

SQL FROM clause using n>1 tables

If you add more than one table to the FROM clause (in a query), how does this impact the result set? Does it first select from the first table then from the second and then create a union (i.e., only the rowspace is impacted?) or does it actually do something like a join (i.e., extend the column space)? And when you use multiple tables in the FROM clause, does the WHERE clause filter both sub-result-sets?
Specifying two tables in your FROM clause will execute a JOIN. You can then use the WHERE clause to specify your JOIN conditions. If you fail to do this, you will end-up with a Cartesian product (every row in the first table indiscriminately joined to every row in the second).
The code will look something like this:
SELECT a.*, b.*
FROM table1 a, table2 b
WHERE a.id = b.id
However, I always try to explicitly specify my JOINs (with JOIN and ON keywords). That makes it abundantly clear (for the next developer) as to what you're trying to do. Here's the same JOIN, but explicitly specified:
SELECT a.*, b.*
FROM table1 a
INNER JOIN table2 b ON b.id = a.id
Note that now I don't need a WHERE clause. This method also helps you avoid generating an inadvertent Cartesian product (if you happen to forget your WHERE clause), because the ON is specified explicitly.

WHERE clause better execute before IN and JOIN or after

I read this article:
Logical Processing Order of the SELECT statement
in end of article has been write ON and JOIN clause consider before WHERE.
Consider we have a master table that has 10 million records and a detail table (that has reference to master table(FK)) with 50 million record. We have a query that want just 100 record of detail table according a PK in master table.
In this situation ON and JOIN execute before WHERE?I mean that do we have 500 million record after JOIN and then WHERE apply to it?or first WHERE apply and then JOIN and ON Consider? If second answer is true do it has incoherence with top article?
thanks
In the case of an INNER JOIN or a table on the left in a LEFT JOIN, in many cases, the optimizer will find that it is better to perform any filtering first (highest selectivity) before actually performing whatever type of physical join - so there are obviously physical order of operations which are better.
To some extent you can sometimes control this (or interfere with this) with your SQL, for instance, with aggregates in subqueries.
The logical order of processing the constraints in the query can only be transformed according to known invariant transformations.
So:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is still logically equivalent to:
SELECT *
FROM a
INNER JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and they will generally have the same execution plan.
On the other hand:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
WHERE a.something = something
AND b.something = something
is NOT equivalent to:
SELECT *
FROM a
LEFT JOIN b
ON a.id = b.id
AND a.something = something
AND b.something = something
and so the optimizer isn't going to transform them into the same execution plan.
The optimizer is very smart and is able to move things around pretty successfully, including collapsing views and inline table-valued functions as well as even pushing things down through certain kinds of aggregates fairly successfully.
Typically, when you write SQL, it needs to be understandable, maintainable and correct. As far as efficiency in execution, if the optimizer is having difficulty turning the declarative SQL into an execution plan with acceptable performance, the code can sometimes be simplified or appropriate indexes or hints added or broken down into steps which should perform more quickly - all in successive orders of invasiveness.
It doesn't matter
Logical processing order is always honoured: regardless of actual processing order
INNER JOINs and WHERE conditions are effectively associative and commutative (hence the ANSI-89 "join in the where" syntax) so actual order doesn't matter
Logical order becomes important with outer joins and more complex queries: applying WHERE on an OUTER table changes the logic completely.
Again, it doesn't matter how the optimiser does it internally so long as the query semantics are maintained by following logical processing order.
And the key word here is "optimiser": it does exactly what it says
Just re-reading Paul White's excellent series on the Query Optimiser and remembered this question.
It is possible to use an undocumented command to disable specific transformation rules and get some insight into the transformations applied.
For (hopefully!) obvious reasons only try this on a development instance and remember to re-enable them and remove any suboptimal plans from the cache.
USE AdventureWorks2008;
/*Disable the rules*/
DBCC RULEOFF ('SELonJN');
DBCC RULEOFF ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
You can see with those two rules disabled it does a cartesian join and filter after.
/*Re-enable them*/
DBCC RULEON ('SELonJN');
DBCC RULEON ('BuildSpool');
SELECT P.ProductNumber,
P.ProductID,
I.Quantity
FROM Production.Product P
JOIN Production.ProductInventory I
ON I.ProductID = P.ProductID
WHERE I.ProductID < 3
OPTION (RECOMPILE)
With them enabled the predicate is pushed right down into the index seek and so reduces the number of rows processed by the join operation.
There is no defined order. The SQL engine determines what order to perform the operations based on the execution strategy chosen by its optimizer.
I think you have misread ON as IN in the article.
However, the order it is showing in the article is correct (obviously it is msdn anyway). The ON and JOIN are executed before WHERE naturally because WHERE has to be applied as a filter on the temporary resultset obtained due to JOINS
The article just says it is logical order of execution and also at end of the paragraph it adds this line too ;)
"Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list."

Resources