Merge Join over sorted columns instead of Hash Join - sql-server

I have two tables
Table A
(
id int,
name varchar(39),
lname varchar (49),
...
)
Table B
(
id int,
city varchar(39),
...
)
Both tables are sorted on column ID. IDs are simply identities and are populated by auto incremented integers 1 to n.
However, if I input a query e.g.,
SELECT *
FROM A, B
WHERE A.id = B.id;
I get a hash join instead of the efficient merge join. How can I enforce the merge join in SQL Server instead? I don't want to use an index, thus no index-based plans.
Note that I don't want a merge-join with a sort-enforcer either, I know that one can hint the planner by rewriting the query to
SELECT *
FROM A
INNER MERGE JOIN B ON A.ID = B.ID;
By the way I'm using SQL Server Express edition. But I can change to any open source DB if the latter supports the query plan that I'm aiming.
Thanks in advance

If you believe you are smarted then the SQL Engine :-) you can use hits like this:
SELECT *
FROM A
INNER HASH JOIN B
ON A.id = B.id
OR
SELECT *
FROM A
INNER MERGE JOIN B
ON A.id = B.id
At least, you can test if the MERGE will be really better. And even it is better in this case, does not mean that it will be the best choice always. It can reduce the performance in other cases, so generally it will be better to leave this work to the engine.

Related

How to limit an inner join in Sybase?

How can I limit an inner join or a subquery that it only selects one row? As it seems I can't use 'top 1' in my Sybase version (Sybase version: Adaptive Server Enterprise/15.5/EBF 19902) in subqueries.
Example
select * from A a
inner join B b on a.id = b.Aid
whereat table B has two records linked to table A (same Aid). But I'd like to join only one of these records.
I tried to replace the inner join with a subquery and using top 1, but this is not allowed.
I found a solution here: https://www.periscopedata.com/blog/4-ways-to-join-only-the-first-row-in-sql.html
select * from A a
inner join (select * from B b where b.Aid in (select min(Aid) from B group by Aid) )
as b on b.Aid = a.id
Came across this post while going from "Sybase ASE doesn't support ROW_NUMBER()" to "TOP is not allowed in subqueries", to how tf do Sybase engineers expect us to limit a subquery result to 1 record? All solutions I've seen rely on min/max, but I haven't seen anything supporting a "ORDER BY" type of sorting.
So a simple
ROW_NUMBER() OVER (ORDER BY t.SEQUENCE, t.FROMDATE, t.FROMTIME)
becomes
select MYVALUE
from (SELECT someExpression as MYVALUE,
RIGHT(CONVERT(VARCHAR,1000000+fs.SEQUENCE), 6) || CONVERT(VARCHAR,t.FROMDATE, 23) || CONVERT(VARCHAR,t.FROMTIME) as SORTKEY
FROM MY_TABLE t)
having SORTKEY = MIN(SORTKEY)
which is rather ugly, using all sorts of hacks to support string-sorting of the ORDER BY fields. As this will be used in subqueries, table alias scoping will mean that table joins need to be replicated.
The only alternative I can think of is a cursor with a break-condition so only the first row is processed, but that'll slow down things considerably.

Too many parameter values slowing down query

I have a query that runs fairly fast under normal circumstances. But it is running very slow (at least 20 minutes in SSMS) due to how many values are in the filter.
Here's the generic version of it, and you can see that one part is filtering by over 8,000 values, making it run slow.
SELECT DISTINCT
column
FROM
table_a a
JOIN
table_b b ON (a.KEY = b.KEY)
WHERE
a.date BETWEEN #Start and #End
AND b.ID IN (... over 8,000 values)
AND b.place IN ( ... 20 values)
ORDER BY
a.column ASC
It's to the point where it's too slow to use in the production application.
Does anyone know how to fix this, or optimize the query?
To make a query fast, you need indexes.
You need a separate index for the following columns: a.KEY, b.KEY, a.date, b.ID, b.place.
As gotqn wrote before, if you put your 8000 items to a temp table, and inner join it, it will make the query even faster too, but without the index on the other part of the join it will be slow even then.
What you need is to put the filtering values in temporary table. Then use the table to apply filtering using INNER JOIN instead of WHERE IN. For example:
IF OBJECT_ID('tempdb..#FilterDataSource') IS NOT NULL
BEGIN;
DROP TABLE #FilterDataSource;
END;
CREATE TABLE #FilterDataSource
(
[ID] INT PRIMARY KEY
);
INSERT INTO #FilterDataSource ([ID])
-- you need to split values
SELECT DISTINCT column
FROM table_a a
INNER JOIN table_b b
ON (a.KEY = b.KEY)
INNER JOIN #FilterDataSource FS
ON b.id = FS.ID
WHERE a.date BETWEEN #Start and #End
AND b.place IN ( ... 20 values)
ORDER BY .column ASC;
Few important notes:
we are using temporary table in order to allow parallel execution plans to be used
if you have fast (for example CLR function) for spiting, you can join the function itself
it is not good to use IN with many values, the SQL Server is not able to build always the execution plan which may lead to time outs/internal error - you can find more information here

Shortcut for adding table to column name SQL-server 2014

Stupidly simple question, but I just don't know what to google!
If I create a query like this:
Select id, data
from table1
Now I want to join with table2. I can immediately see that the id column is no longer unique and I have to change it to
table1.id
Is there any smart way (like a keyboard-shortcut) to do this, instead of manually adding table1 to every column? Either before I add the Join to secure that all columns will be unique, or after with suggestions based on the different possible tables.
No, there is no helper.
But do not you can alias the table name:
select x.Col1, y.Col2
from ALongTableName x
inner join AReallyReallyLongTableName y on x.Id = y.OtherId
which can also make queries clearer, and is very much necessary when doing self joins.
First of all, you should start using aliases:
SQL aliases are used to give a database table, or a column in a table,
a temporary name.
Basically aliases are created to make column names more readable.
This will narrow down your problem and make your code maintenance easier. If that's not enough, I guess you could start using auto-completion tools, such as these:
SQL Complete
SQL Prompt
ApexSQL Complete
These have your desired functionality, however, they do not always work as expected (at least for me).
Oh! You can use alias table name. Like this:
SELECT A.ID, A.data
FROM TableA A
INNER JOIN TableB B
ON A.ID = B.ID
You just only use A. or B. if two table have same this column selected. If they different, you don't need: Like this:
SELECT A.ID, data -- if Table B not have column data
FROM TableA A
INNER JOIN TableB B
ON A.ID = B.ID
Or:
Select A.*, B.ID
FROM TableA A
INNER JOIN TableB B
ON A.ID = B.ID

how to improve performance when rewrite join SQL?

Suppose I have 2 table need to join. There are 2 way to write the sql:
select * from taba a join tabb b on a.id =b.id where ...
select * from taba a, tabb b where a.id = b.id and ...
which one has better performance or this is only syntax issue with different SQL standard regardless of performance?
Has been already answered here
stackoverflow.com/questions/1129923/is-a-join-faster-than-a-where
The query optimizer usually use more a join than a where clause (so in theory is better the join) but the last word is said by the db engine you're using
The best advice is to try

SQL FROM clause using n>1 tables

If you add more than one table to the FROM clause (in a query), how does this impact the result set? Does it first select from the first table then from the second and then create a union (i.e., only the rowspace is impacted?) or does it actually do something like a join (i.e., extend the column space)? And when you use multiple tables in the FROM clause, does the WHERE clause filter both sub-result-sets?
Specifying two tables in your FROM clause will execute a JOIN. You can then use the WHERE clause to specify your JOIN conditions. If you fail to do this, you will end-up with a Cartesian product (every row in the first table indiscriminately joined to every row in the second).
The code will look something like this:
SELECT a.*, b.*
FROM table1 a, table2 b
WHERE a.id = b.id
However, I always try to explicitly specify my JOINs (with JOIN and ON keywords). That makes it abundantly clear (for the next developer) as to what you're trying to do. Here's the same JOIN, but explicitly specified:
SELECT a.*, b.*
FROM table1 a
INNER JOIN table2 b ON b.id = a.id
Note that now I don't need a WHERE clause. This method also helps you avoid generating an inadvertent Cartesian product (if you happen to forget your WHERE clause), because the ON is specified explicitly.

Resources