Is recursion good in SQL Server? - sql-server

I have a table in SQL server that has the normal tree structure of Item_ID, Item_ParentID.
Suppose I want to iterate and get all CHILDREN of a particular Item_ID (at any level).
Recursion seems an intuitive candidate for this problem and I can write an SQL Server function to do this.
Will this affect performance if my table has many many records?
How do I avoid recursion and simply query the table? Please any suggestions?

With the new MS SQL 2005 you could use the WITHkeyword
Check out this question and particularly this answer.
With Oracle you could use CONNECT BY keyword to generate hierarchical queries (syntax).
AFAIK with MySQL you'll have to use the recursion.
Alternatively you could always build a cache table for your records parent->child relationships

As a general answer, it is possible to do some pretty sophisticated stuff in SQL Server that normally needs recursion, simply by using an iterative algorithm. I managed to do an XHTML parser in Transact SQL that worked surprisingly well. The the code prettifier I wrote was done in a stored procedure. It aint elegant, it is rather like watching buffalo doing Ballet. but it works .

Are you using SQL 2005?
If so you can use Common Table Expressions for this. Something along these lines:
;
with CTE (Some, Columns, ItemId, ParentId) as
(
select Some, Columns, ItemId, ParentId
from myTable
where ItemId = #itemID
union all
select a.Some, a.Columns, a.ItemId, a.ParentId
from myTable as a
inner join CTE as b on a.ParentId = b.ItemId
where a.ItemId <> b.ItemId
)
select * from CTE

The problem you will face with recursion and performance is how many times it will have to recurse to return the results. Each recursive call is another separate call that will have to be joined into the total results.
In SQL 2k5 you can use a common table expression to handle this recursion:
WITH Managers AS
(
--initialization
SELECT EmployeeID, LastName, ReportsTo
FROM Employees
WHERE ReportsTo IS NULL
UNION ALL
--recursive execution
SELECT e.employeeID,e.LastName, e.ReportsTo
FROM Employees e INNER JOIN Managers m
ON e.ReportsTo = m.employeeID
)
SELECT * FROM Managers
or another solution is to flatten the hierarchy into a another table
Employee_Managers
ManagerId (PK, FK to Employee table)
EmployeeId (PK, FK to Employee table)
All the parent child relation ships would be stored in this table, so if Manager 1 manages Manager 2 manages employee 3, the table would look like:
ManagerId EmployeeId
1 2
1 3
2 1
This allows the hierarchy to be easily queried:
select * from employee_managers em
inner join employee e on e.employeeid = em.employeeid and em.managerid = 42
Which would return all employees that have manager 42. The upside will be greater performance, but downside is going to be maintaining the hierarchy

Joe Celko has a book (<- link to Amazon) specifically on tree structures in SQL databases. While you would need recursion for your model and there would definitely be a potential for performance issues there, there are alternative ways to model a tree structure depending on what your specific problem involves which could avoid recursion and give better performance.

Perhaps some more detail is in order.
If you have a master-detail relationship as you describe, then won't a simple JOIN get what you need?
As in:
SELECT
SOME_FIELDS
FROM
MASTER_TABLE MT
,CHILD_TABLE CT
WHERE CT.PARENT_ID = MT.ITEM_ID

You shouldn't need recursion for children - you're only looking at the level directly below (i.e. select * from T where ParentId = #parent) - you only need recursion for all descendants.
In SQL2005 you can get the descendants with:
with AllDescendants (ItemId, ItemText) as (
select t.ItemId, t.ItemText
from [TableName] t
where t.ItemId = #ancestorId
union
select sub.ItemId, sub.ItemText
from [TableName] sub
inner join [TableName] tree
on tree.ItemId = sub.ParentItemId
)

You don't need recursion at all....
Note, I changed columns to ItemID and ItemParentID for ease of typing...
DECLARE #intLevel INT
SET #intLevel = 1
INSERT INTO TempTable(ItemID, ItemParentID, Level)
SELECT ItemID, ItemParentID, #intLevel
WHERE ItemParentID IS NULL
WHILE #intLevel < #TargetLevel
BEGIN
SET #intLevel = #intLevel + 1
INSERT INTO TempTable(ItemID, ItemParentID, Level)
SELECt ItemID, ItemParentID, #intLevel
WHERE ItemParentID IN (SELECT ItemID FROM TempTable WHERE Level = #intLevel-1)
-- If no rows are inserted then there are no children
IF ##ROWCOUNT = 0
BREAK
END
SELECt ItemID FROM TempTable WHERE Level = #TargetLevel

Related

How to create multiple return subquery? [duplicate]

Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
This question already has answers here:
Retrieving the last record in each group - MySQL
(33 answers)
Closed 3 years ago.
I have this table for documents (simplified version here):
id
rev
content
1
1
...
2
1
...
1
2
...
1
3
...
How do I select one row per id and only the greatest rev?
With the above data, the result should contain two rows: [1, 3, ...] and [2, 1, ..]. I'm using MySQL.
Currently I use checks in the while loop to detect and over-write old revs from the resultset. But is this the only method to achieve the result? Isn't there a SQL solution?
At first glance...
All you need is a GROUP BY clause with the MAX aggregate function:
SELECT id, MAX(rev)
FROM YourTable
GROUP BY id
It's never that simple, is it?
I just noticed you need the content column as well.
This is a very common question in SQL: find the whole data for the row with some max value in a column per some group identifier. I heard that a lot during my career. Actually, it was one the questions I answered in my current job's technical interview.
It is, actually, so common that Stack Overflow community has created a single tag just to deal with questions like that: greatest-n-per-group.
Basically, you have two approaches to solve that problem:
Joining with simple group-identifier, max-value-in-group Sub-query
In this approach, you first find the group-identifier, max-value-in-group (already solved above) in a sub-query. Then you join your table to the sub-query with equality on both group-identifier and max-value-in-group:
SELECT a.id, a.rev, a.contents
FROM YourTable a
INNER JOIN (
SELECT id, MAX(rev) rev
FROM YourTable
GROUP BY id
) b ON a.id = b.id AND a.rev = b.rev
Left Joining with self, tweaking join conditions and filters
In this approach, you left join the table with itself. Equality goes in the group-identifier. Then, 2 smart moves:
The second join condition is having left side value less than right value
When you do step 1, the row(s) that actually have the max value will have NULL in the right side (it's a LEFT JOIN, remember?). Then, we filter the joined result, showing only the rows where the right side is NULL.
So you end up with:
SELECT a.*
FROM YourTable a
LEFT OUTER JOIN YourTable b
ON a.id = b.id AND a.rev < b.rev
WHERE b.id IS NULL;
Conclusion
Both approaches bring the exact same result.
If you have two rows with max-value-in-group for group-identifier, both rows will be in the result in both approaches.
Both approaches are SQL ANSI compatible, thus, will work with your favorite RDBMS, regardless of its "flavor".
Both approaches are also performance friendly, however your mileage may vary (RDBMS, DB Structure, Indexes, etc.). So when you pick one approach over the other, benchmark. And make sure you pick the one which make most of sense to you.
My preference is to use as little code as possible...
You can do it using IN
try this:
SELECT *
FROM t1 WHERE (id,rev) IN
( SELECT id, MAX(rev)
FROM t1
GROUP BY id
)
to my mind it is less complicated... easier to read and maintain.
I am flabbergasted that no answer offered SQL window function solution:
SELECT a.id, a.rev, a.contents
FROM (SELECT id, rev, contents,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) ranked_order
FROM YourTable) a
WHERE a.ranked_order = 1
Added in SQL standard ANSI/ISO Standard SQL:2003 and later extended with ANSI/ISO Standard SQL:2008, window (or windowing) functions are available with all major vendors now. There are more types of rank functions available to deal with a tie issue: RANK, DENSE_RANK, PERSENT_RANK.
Yet another solution is to use a correlated subquery:
select yt.id, yt.rev, yt.contents
from YourTable yt
where rev =
(select max(rev) from YourTable st where yt.id=st.id)
Having an index on (id,rev) renders the subquery almost as a simple lookup...
Following are comparisons to the solutions in #AdrianCarneiro's answer (subquery, leftjoin), based on MySQL measurements with InnoDB table of ~1million records, group size being: 1-3.
While for full table scans subquery/leftjoin/correlated timings relate to each other as 6/8/9, when it comes to direct lookups or batch (id in (1,2,3)), subquery is much slower then the others (Due to rerunning the subquery). However I couldnt differentiate between leftjoin and correlated solutions in speed.
One final note, as leftjoin creates n*(n+1)/2 joins in groups, its performance can be heavily affected by the size of groups...
I can't vouch for the performance, but here's a trick inspired by the limitations of Microsoft Excel. It has some good features
GOOD STUFF
It should force return of only one "max record" even if there is a tie (sometimes useful)
It doesn't require a join
APPROACH
It is a little bit ugly and requires that you know something about the range of valid values of the rev column. Let us assume that we know the rev column is a number between 0.00 and 999 including decimals but that there will only ever be two digits to the right of the decimal point (e.g. 34.17 would be a valid value).
The gist of the thing is that you create a single synthetic column by string concatenating/packing the primary comparison field along with the data you want. In this way, you can force SQL's MAX() aggregate function to return all of the data (because it has been packed into a single column). Then you have to unpack the data.
Here's how it looks with the above example, written in SQL
SELECT id,
CAST(SUBSTRING(max(packed_col) FROM 2 FOR 6) AS float) as max_rev,
SUBSTRING(max(packed_col) FROM 11) AS content_for_max_rev
FROM (SELECT id,
CAST(1000 + rev + .001 as CHAR) || '---' || CAST(content AS char) AS packed_col
FROM yourtable
)
GROUP BY id
The packing begins by forcing the rev column to be a number of known character length regardless of the value of rev so that for example
3.2 becomes 1003.201
57 becomes 1057.001
923.88 becomes 1923.881
If you do it right, string comparison of two numbers should yield the same "max" as numeric comparison of the two numbers and it's easy to convert back to the original number using the substring function (which is available in one form or another pretty much everywhere).
Unique Identifiers? Yes! Unique identifiers!
One of the best ways to develop a MySQL DB is to have each id AUTOINCREMENT (Source MySQL.com). This allows a variety of advantages, too many to cover here. The problem with the question is that its example has duplicate ids. This disregards these tremendous advantages of unique identifiers, and at the same time, is confusing to those familiar with this already.
Cleanest Solution
DB Fiddle
Newer versions of MySQL come with ONLY_FULL_GROUP_BY enabled by default, and many of the solutions here will fail in testing with this condition.
Even so, we can simply select DISTINCT someuniquefield, MAX( whateverotherfieldtoselect ), ( *somethirdfield ), etc., and have no worries understanding the result or how the query works :
SELECT DISTINCT t1.id, MAX(t1.rev), MAX(t2.content)
FROM Table1 AS t1
JOIN Table1 AS t2 ON t2.id = t1.id AND t2.rev = (
SELECT MAX(rev) FROM Table1 t3 WHERE t3.id = t1.id
)
GROUP BY t1.id;
SELECT DISTINCT Table1.id, max(Table1.rev), max(Table2.content) : Return DISTINCT somefield, MAX() some otherfield, the last MAX() is redundant, because I know it's just one row, but it's required by the query.
FROM Employee : Table searched on.
JOIN Table1 AS Table2 ON Table2.rev = Table1.rev : Join the second table on the first, because, we need to get the max(table1.rev)'s comment.
GROUP BY Table1.id: Force the top-sorted, Salary row of each employee to be the returned result.
Note that since "content" was "..." in OP's question, there's no way to test that this works. So, I changed that to "..a", "..b", so, we can actually now see that the results are correct:
id max(Table1.rev) max(Table2.content)
1 3 ..d
2 1 ..b
Why is it clean? DISTINCT(), MAX(), etc., all make wonderful use of MySQL indices. This will be faster. Or, it will be much faster, if you have indexing, and you compare it to a query that looks at all rows.
Original Solution
With ONLY_FULL_GROUP_BY disabled, we can use still use GROUP BY, but then we are only using it on the Salary, and not the id:
SELECT *
FROM
(SELECT *
FROM Employee
ORDER BY Salary DESC)
AS employeesub
GROUP BY employeesub.Salary;
SELECT * : Return all fields.
FROM Employee : Table searched on.
(SELECT *...) subquery : Return all people, sorted by Salary.
GROUP BY employeesub.Salary: Force the top-sorted, Salary row of each employee to be the returned result.
Unique-Row Solution
Note the Definition of a Relational Database: "Each row in a table has its own unique key." This would mean that, in the question's example, id would have to be unique, and in that case, we can just do :
SELECT *
FROM Employee
WHERE Employee.id = 12345
ORDER BY Employee.Salary DESC
LIMIT 1
Hopefully this is a solution that solves the problem and helps everyone better understand what's happening in the DB.
Another manner to do the job is using MAX() analytic function in OVER PARTITION clause
SELECT t.*
FROM
(
SELECT id
,rev
,contents
,MAX(rev) OVER (PARTITION BY id) as max_rev
FROM YourTable
) t
WHERE t.rev = t.max_rev
The other ROW_NUMBER() OVER PARTITION solution already documented in this post is
SELECT t.*
FROM
(
SELECT id
,rev
,contents
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rank
FROM YourTable
) t
WHERE t.rank = 1
This 2 SELECT work well on Oracle 10g.
MAX() solution runs certainly FASTER that ROW_NUMBER() solution because MAX() complexity is O(n) while ROW_NUMBER() complexity is at minimum O(n.log(n)) where n represent the number of records in table !
Something like this?
SELECT yourtable.id, rev, content
FROM yourtable
INNER JOIN (
SELECT id, max(rev) as maxrev
FROM yourtable
GROUP BY id
) AS child ON (yourtable.id = child.id) AND (yourtable.rev = maxrev)
I like to use a NOT EXIST-based solution for this problem:
SELECT
id,
rev
-- you can select other columns here
FROM YourTable t
WHERE NOT EXISTS (
SELECT * FROM YourTable t WHERE t.id = id AND rev > t.rev
)
This will select all records with max value within the group and allows you to select other columns.
SELECT *
FROM Employee
where Employee.Salary in (select max(salary) from Employee group by Employe_id)
ORDER BY Employee.Salary
Note: I probably wouldn't recommend this anymore in MySQL 8+ days. Haven't used it in years.
A third solution I hardly ever see mentioned is MySQL specific and looks like this:
SELECT id, MAX(rev) AS rev
, 0+SUBSTRING_INDEX(GROUP_CONCAT(numeric_content ORDER BY rev DESC), ',', 1) AS numeric_content
FROM t1
GROUP BY id
Yes it looks awful (converting to string and back etc.) but in my experience it's usually faster than the other solutions. Maybe that's just for my use cases, but I have used it on tables with millions of records and many unique ids. Maybe it's because MySQL is pretty bad at optimizing the other solutions (at least in the 5.0 days when I came up with this solution).
One important thing is that GROUP_CONCAT has a maximum length for the string it can build up. You probably want to raise this limit by setting the group_concat_max_len variable. And keep in mind that this will be a limit on scaling if you have a large number of rows.
Anyway, the above doesn't directly work if your content field is already text. In that case you probably want to use a different separator, like \0 maybe. You'll also run into the group_concat_max_len limit quicker.
I think, You want this?
select * from docs where (id, rev) IN (select id, max(rev) as rev from docs group by id order by id)
SQL Fiddle :
Check here
NOT mySQL, but for other people finding this question and using SQL, another way to resolve the greatest-n-per-group problem is using Cross Apply in MS SQL
WITH DocIds AS (SELECT DISTINCT id FROM docs)
SELECT d2.id, d2.rev, d2.content
FROM DocIds d1
CROSS APPLY (
SELECT Top 1 * FROM docs d
WHERE d.id = d1.id
ORDER BY rev DESC
) d2
Here's an example in SqlFiddle
I would use this:
select t.*
from test as t
join
(select max(rev) as rev
from test
group by id) as o
on o.rev = t.rev
Subquery SELECT is not too eficient maybe, but in JOIN clause seems to be usable. I'm not an expert in optimizing queries, but I've tried at MySQL, PostgreSQL, FireBird and it does work very good.
You can use this schema in multiple joins and with WHERE clause. It is my working example (solving identical to yours problem with table "firmy"):
select *
from platnosci as p
join firmy as f
on p.id_rel_firmy = f.id_rel
join (select max(id_obj) as id_obj
from firmy
group by id_rel) as o
on o.id_obj = f.id_obj and p.od > '2014-03-01'
It is asked on tables having teens thusands of records, and it takes less then 0,01 second on really not too strong machine.
I wouldn't use IN clause (as it is mentioned somewhere above). IN is given to use with short lists of constans, and not as to be the query filter built on subquery. It is because subquery in IN is performed for every scanned record which can made query taking very loooong time.
Since this is most popular question with regard to this problem, I'll re-post another answer to it here as well:
It looks like there is simpler way to do this (but only in MySQL):
select *
from (select * from mytable order by id, rev desc ) x
group by id
Please credit answer of user Bohemian in this question for providing such a concise and elegant answer to this problem.
Edit: though this solution works for many people it may not be stable in the long run, since MySQL doesn't guarantee that GROUP BY statement will return meaningful values for columns not in GROUP BY list. So use this solution at your own risk!
If you have many fields in select statement and you want latest value for all of those fields through optimized code:
select * from
(select * from table_name
order by id,rev desc) temp
group by id
How about this:
SELECT all_fields.*
FROM (SELECT id, MAX(rev) FROM yourtable GROUP BY id) AS max_recs
LEFT OUTER JOIN yourtable AS all_fields
ON max_recs.id = all_fields.id
This solution makes only one selection from YourTable, therefore it's faster. It works only for MySQL and SQLite(for SQLite remove DESC) according to test on sqlfiddle.com. Maybe it can be tweaked to work on other languages which I am not familiar with.
SELECT *
FROM ( SELECT *
FROM ( SELECT 1 as id, 1 as rev, 'content1' as content
UNION
SELECT 2, 1, 'content2'
UNION
SELECT 1, 2, 'content3'
UNION
SELECT 1, 3, 'content4'
) as YourTable
ORDER BY id, rev DESC
) as YourTable
GROUP BY id
Here is a nice way of doing that
Use following code :
with temp as (
select count(field1) as summ , field1
from table_name
group by field1 )
select * from temp where summ = (select max(summ) from temp)
I like to do this by ranking the records by some column. In this case, rank rev values grouped by id. Those with higher rev will have lower rankings. So highest rev will have ranking of 1.
select id, rev, content
from
(select
#rowNum := if(#prevValue = id, #rowNum+1, 1) as row_num,
id, rev, content,
#prevValue := id
from
(select id, rev, content from YOURTABLE order by id asc, rev desc) TEMP,
(select #rowNum := 1 from DUAL) X,
(select #prevValue := -1 from DUAL) Y) TEMP
where row_num = 1;
Not sure if introducing variables makes the whole thing slower. But at least I'm not querying YOURTABLE twice.
here is another solution hope it will help someone
Select a.id , a.rev, a.content from Table1 a
inner join
(SELECT id, max(rev) rev FROM Table1 GROUP BY id) x on x.id =a.id and x.rev =a.rev
None of these answers have worked for me.
This is what worked for me.
with score as (select max(score_up) from history)
select history.* from score, history where history.score_up = score.max
Here's another solution to retrieving the records only with a field that has the maximum value for that field. This works for SQL400 which is the platform I work on. In this example, the records with the maximum value in field FIELD5 will be retrieved by the following SQL statement.
SELECT A.KEYFIELD1, A.KEYFIELD2, A.FIELD3, A.FIELD4, A.FIELD5
FROM MYFILE A
WHERE RRN(A) IN
(SELECT RRN(B)
FROM MYFILE B
WHERE B.KEYFIELD1 = A.KEYFIELD1 AND B.KEYFIELD2 = A.KEYFIELD2
ORDER BY B.FIELD5 DESC
FETCH FIRST ROW ONLY)
Sorted the rev field in reverse order and then grouped by id which gave the first row of each grouping which is the one with the highest rev value.
SELECT * FROM (SELECT * FROM table1 ORDER BY id, rev DESC) X GROUP BY X.id;
Tested in http://sqlfiddle.com/ with the following data
CREATE TABLE table1
(`id` int, `rev` int, `content` varchar(11));
INSERT INTO table1
(`id`, `rev`, `content`)
VALUES
(1, 1, 'One-One'),
(1, 2, 'One-Two'),
(2, 1, 'Two-One'),
(2, 2, 'Two-Two'),
(3, 2, 'Three-Two'),
(3, 1, 'Three-One'),
(3, 3, 'Three-Three')
;
This gave the following result in MySql 5.5 and 5.6
id rev content
1 2 One-Two
2 2 Two-Two
3 3 Three-Two
You can make the select without a join when you combine the rev and id into one maxRevId value for MAX() and then split it back to original values:
SELECT maxRevId & ((1 << 32) - 1) as id, maxRevId >> 32 AS rev
FROM (SELECT MAX(((rev << 32) | id)) AS maxRevId
FROM YourTable
GROUP BY id) x;
This is especially fast when there is a complex join instead of a single table. With the traditional approaches the complex join would be done twice.
The above combination is simple with bit functions when rev and id are INT UNSIGNED (32 bit) and combined value fits to BIGINT UNSIGNED (64 bit). When the id & rev are larger than 32-bit values or made of multiple columns, you need combine the value into e.g. a binary value with suitable padding for MAX().
Explanation
This is not pure SQL. This will use the SQLAlchemy ORM.
I came here looking for SQLAlchemy help, so I will duplicate Adrian Carneiro's answer with the python/SQLAlchemy version, specifically the outer join part.
This query answers the question of:
"Can you return me the records in this group of records (based on same id) that have the highest version number".
This allows me to duplicate the record, update it, increment its version number, and have the copy of the old version in such a way that I can show change over time.
Code
MyTableAlias = aliased(MyTable)
newest_records = appdb.session.query(MyTable).select_from(join(
MyTable,
MyTableAlias,
onclause=and_(
MyTable.id == MyTableAlias.id,
MyTable.version_int < MyTableAlias.version_int
),
isouter=True
)
).filter(
MyTableAlias.id == None,
).all()
Tested on a PostgreSQL database.
I used the below to solve a problem of my own. I first created a temp table and inserted the max rev value per unique id.
CREATE TABLE #temp1
(
id varchar(20)
, rev int
)
INSERT INTO #temp1
SELECT a.id, MAX(a.rev) as rev
FROM
(
SELECT id, content, SUM(rev) as rev
FROM YourTable
GROUP BY id, content
) as a
GROUP BY a.id
ORDER BY a.id
I then joined these max values (#temp1) to all of the possible id/content combinations. By doing this, I naturally filter out the non-maximum id/content combinations, and am left with the only max rev values for each.
SELECT a.id, a.rev, content
FROM #temp1 as a
LEFT JOIN
(
SELECT id, content, SUM(rev) as rev
FROM YourTable
GROUP BY id, content
) as b on a.id = b.id and a.rev = b.rev
GROUP BY a.id, a.rev, b.content
ORDER BY a.id

How to improve performance of this SQL Server query?

I was asked this question at web developer interview. after my answer interviewer said your in second table :(
I have two tables employee and bademployee:
employee (empid int pk, name varchar(20)`)
bademployee (badempid int pk, name varchar(20))
Now, I want to select only good employees.
My answer was :
SELECT *
FROM employee
WHERE empid NOT IN (SELECT badempid from bademployee)
He said this query is not good for performance.
Can any one tell me how to write query for same result, by not using negative terms(not in, !=).
Can it be done using LEFT OUTER JOIN ?
This can be rewritten using an OUTER JOIN with a NULL check or by using NOT EXISTS. I prefer NOT EXISTS:
SELECT *
FROM Employee e
WHERE NOT EXISTS (
SELECT 1
FROM bademployee b
WHERE e.empid = b.badempid)
Here is the OUTER JOIN, but I believe you'll have better performace with NOT EXISTS.
SELECT e.*
FROM Employee e
LEFT JOIN bademployee b ON e.empid = b.badempid
WHERE b.badempid IS NULL
Here's an interesting article about the performance differences: http://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join
Whatever someone else may say, you need to check the execution plan and base your conclusion on what that sais. Never just trust someone else that claims this or that, research into his claims and verify that with documentation on the subject and in this case the execution plan which clearly tells you what is going on.
One example from SQL Authority blogs shows that the LEFT JOIN solution performs much worse than the NOT IN solution. This is due to a LEFT ANTI SEMI JOIN done by the query planner which generally performs a lot better than a LEFT JOIN + NULL check. There may be exceptions when there are very few rows. The author also tells you afterwards the same as I did in the first paragraph: always check the execution plan.
Another blog post from SQL Performance blogs goes into this further with actual performance testing results.
TL;DR: In terms of performance NOT EXISTS and NOT IN are on the same level but NOT EXISTS is prefered due to issues with NULL values. Also, don't just trust what anyone claims, research and verify your execution plan.
I think the interviewer was wrong about the performance difference. Because the joined column is unique and not null in both tables, the NOT IN, NOT EXISTS, and LEFT JOIN...WHERE IS NULL queries are semantically identical. SQL is a declarative language so the SQL Server optimizer may provide optimal and identical plans regardless of now the query is expressed. That said, it is not always perfect so there may be variances, especially with more complex queries.
Below is a script that demonstrates this. On my SQL Server 2014 box, I see identical execution plans for the first 2 queries (ordered clustered index scans and a merge join), and the addition of a filter operator in the last. I would expect identical performance with all 3 so it doesn't really matter from a performance perspective. I would generally use NOT EXISTS because the intent is clearer and it avoids the gotcha in the case a NULL is returned by the NOT IN subquery, thus resulting in zero rows returned due to the UNKNOWN predicate result.
I would not generalize performance comparisons like this. If the joined columns allow NULL or are not guaranteed to be unique, these queries are not semantically the same and may yield different execution plans as a result.
CREATE TABLE dbo.employee (
empid int CONSTRAINT pk_employee PRIMARY KEY
, name varchar(20)
);
CREATE TABLE dbo.bademployee (
badempid int CONSTRAINT pk_bademployee PRIMARY KEY
, name varchar(20)
);
WITH
t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n))
,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d)
,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c)
INSERT INTO dbo.employee(empid, name)
SELECT num, 'Employee name ' + CAST(num AS varchar(10))
FROM t16M
WHERE num <= 10000;
INSERT INTO dbo.bademployee(badempid, name)
SELECT TOP 5 PERCENT empid, name
FROM dbo.employee
ORDER BY NEWID();
GO
UPDATE STATISTICS dbo.employee WITH FULLSCAN;
UPDATE STATISTICS dbo.bademployee WITH FULLSCAN;
GO
SELECT *
FROM employee
WHERE empid NOT IN (SELECT badempid from bademployee);
SELECT *
FROM Employee e
WHERE NOT EXISTS (
SELECT 1
FROM bademployee b
WHERE e.empid = b.badempid);
SELECT e.*
FROM Employee e
LEFT JOIN bademployee b ON e.empid = b.badempid
WHERE b.badempid IS NULL;
GO

Tree structure data query in SQL Server

I have a table Person that has 3 columns: Id, Name, ParentId where ParentId is the Id of the parent row.
Currently, to display the entire tree, it would have to loop through all child elements until there's no more child elements. It doesn't seem too efficient.
Is there a better and more efficient way to query this data?
Also, is there a better way to represent this tree like structure in a SQL Server database? An alternative design for my table/database?
I don't think there's anything wrong with the design, assuming you have a limited level of parent-child relationships. Here is a quick example of retrieving the relationship using a recursive CTE:
USE tempdb;
GO
CREATE TABLE dbo.tree
(
ID INT PRIMARY KEY,
name VARCHAR(32),
ParentID INT FOREIGN KEY REFERENCES dbo.tree(ID)
);
INSERT dbo.tree SELECT 1, 'grandpa', NULL
UNION ALL SELECT 2, 'dad', 1
UNION ALL SELECT 3, 'me', 2
UNION ALL SELECT 4, 'mom', 1
UNION ALL SELECT 5, 'grandma', NULL;
;WITH x AS
(
-- anchor:
SELECT ID, name, ParentID, [level] = 0
FROM dbo.tree WHERE ParentID IS NULL
UNION ALL
-- recursive:
SELECT t.ID, t.name, t.ParentID, [level] = x.[level] + 1
FROM x INNER JOIN dbo.tree AS t
ON t.ParentID = x.ID
)
SELECT ID, name, ParentID, [level] FROM x
ORDER BY [level]
OPTION (MAXRECURSION 32);
GO
Don't forget to clean up:
DROP TABLE dbo.tree;
This might be a useful article. An alternative is hierarchyid but I find it overly complex for most scenarios.
Aaron Bertrands answer is very good for the general case. If you only ever need to display the whole tree at once, you can just query the whole table and perform the tree-building in-memory. This is likely to be more convenient and flexible. Performance also will be slightly better (the whole table needs to be downloaded anyway and C# is faster for such calculations than SQL Server).
If you only need a part of the tree this method is not recommended because you'd be downloading more data than needed.

T-SQL filtering on dynamic name-value pairs

I'll describe what I am trying to achieve:
I am passing down to a SP an xml with name value pairs that I put into a table variable, let's say #nameValuePairs.
I need to retrieve a list of IDs for expressions (a table) with those exact match of name-value pairs (attributes, another table) associated.
This is my schema:
Expressions table --> (expressionId, attributeId)
Attributes table --> (attributeId, attributeName, attributeValue)
After trying complicated stuff with dynamic SQL and evil cursors (which works but it's painfully slow) this is what I've got now:
--do the magic plz!
-- retrieve number of name-value pairs
SET #noOfAttributes = select count(*) from #nameValuePairs
select distinct
e.expressionId, a.attributeName, a.attributeValue
into
#temp
from
expressions e
join
attributes a
on
e.attributeId = a.attributeId
join --> this join does the filtering
#nameValuePairs nvp
on
a.attributeName = nvp.name and a.attributeValue = nvp.value
group by
e.expressionId, a.attributeName, a.attributeValue
-- now select the IDs I need
-- since I did a select distinct above if the number of matches
-- for a given ID is the same as noOfAttributes then BINGO!
select distinct
expressionId
from
#temp
group by expressionId
having count(*) = #noOfAttributes
Can people please review and see if they can spot any problems? Is there a better way of doing this?
Any help appreciated!
I belive that this would satisfy the requirement you're trying to meet. I'm not sure how much prettier it is, but it should work and wouldn't require a temp table:
SET #noOfAttributes = select count(*) from #nameValuePairs
SELECT e.expressionid
FROM expression e
LEFT JOIN (
SELECT attributeid
FROM attributes a
JOIN #nameValuePairs nvp ON nvp.name = a.Name AND nvp.Value = a.value
) t ON t.attributeid = e.attributeid
GROUP BY e.expressionid
HAVING SUM(CASE WHEN t.attributeid IS NULL THEN (#noOfAttributes + 1) ELSE 1 END) = #noOfAttributes
EDIT: After doing some more evaluation, I found an issue where certain expressions would be included that shouldn't have been. I've modified my query to take that in to account.
One error I see is that you have no table with an alias of b, yet you are using: a.attributeId = b.attributeId.
Try fixing that and see if it works, unless I am missing something.
EDIT: I think you just fixed this in your edit, but is it supposed to be a.attributeId = e.attributeId?
This is not a bad approach, depending on the sizes and indexes of the tables, including #nameValuePairs. If it these row counts are high or it otherwise becomes slow, you may do better to put #namValuePairs into a temp table instead, add appropriate indexes, and use a single query instead of two separate ones.
I do notice that you are putting columns into #temp that you are not using, would be faster to exclude them (though it would mean duplicate rows in #temp). Also, you second query has both a "distinct" and a "group by" on the same columns. You don't need both so I would drop the "distinct" (probably won't affect performance, because the optimizer already figured this out).
Finally, #temp would probably be faster with a clustered non-unique index on expressionid (I am assuming that this is SQL 2005). You could add it after the SELECT..INTO, but it is usually as fast or faster to add it before you load. This would require you to CREATE #temp first, add the clustered and then use INSERT..SELECT to load it instead.
I'll add an example of merging the queries in a mintue... Ok, here's one way to merge them into a single query (this should be 2000-compatible also):
-- retrieve number of name-value pairs
SET #noOfAttributes = select count(*) from #nameValuePairs
-- now select the IDs I need
-- since I did a select distinct above if the number of matches
-- for a given ID is the same as noOfAttributes then BINGO!
select
expressionId
from
(
select distinct
e.expressionId, a.attributeName, a.attributeValue
from
expressions e
join
attributes a
on
e.attributeId = a.attributeId
join --> this join does the filtering
#nameValuePairs nvp
on
a.attributeName = nvp.name and a.attributeValue = nvp.value
) as Temp
group by expressionId
having count(*) = #noOfAttributes

Is there a way to optimize the query given below

I have the following Query and i need the query to fetch data from SomeTable based on the filter criteria present in the Someothertable. If there is nothing present in SomeOtherTable Query should return me all the data present in SomeTable
SQL SERVER 2005
SomeOtherTable does not have any indexes or any constraint all fields are char(50)
The Following Query work fine for my requirements but it causes performance problems when i have lots of parameters.
Due to some requirement of Client, We have to keep all the Where clause data in SomeOtherTable. depending on subid data will be joined with one of the columns in SomeTable.
For example the Query can can be
SELECT
*
FROM
SomeTable
WHERE
1=1
AND
(
SomeTable.ID in (SELECT DISTINCT ID FROM SomeOtherTable WHERE Name = 'ABC' and subid = 'EF')
OR
0=(SELECT Count(1) FROM SomeOtherTable WHERE spName = 'ABC' and subid = 'EF')
)
AND
(
SomeTable.date =(SELECT date FROM SomeOtherTable WHERE Name = 'ABC' and subid = 'Date')
OR
0=(SELECT Count(1) FROM SomeOtherTable WHERE spName = 'ABC' and subid = 'Date')
)
EDIT----------------------------------------------
I think i might have to explain my problem in detail:
We have developed an ASP.net application that is used to invoke parametrize crystal reports, parameters to the crystal reports are not passed using the default crystal reports method.
In ASP.net application we have created wizards which are used to pass the parameters to the Reports, These parameters are not directly consumed by the crystal report but are consumed by the Query embedded inside the crystal report or the Stored procedure used in the Crystal report.
This is achieved using a table (SomeOtherTable) which holds parameter data as long as report is running after which the data is deleted, as such we can assume that SomeOtherTable has max 2 to 3 rows at any given point of time.
So if we look at the above query initial part of the Query can be assumed as the Report Query and the where clause is used to get the user input from the SomeOtherTable table.
So i don't think it will be useful to create indexes etc (May be i am wrong).
SomeOtherTable does not have any
indexes or any constraint all fields
are char(50)
Well, there's your problem. There's nothing you can do to a query like this which will improve its performance if you create it like this.
You need a proper primary or other candidate key designated on all of your tables. That is to say, you need at least ONE unique index on the table. You can do this by designating one or more fields as the PK, or you can add a UNIQUE constraint or index.
You need to define your fields properly. Does the field store integers? Well then, an INT field may just be a better bet than a CHAR(50).
You can't "optimize" a query that is based on an unsound schema.
Try:
SELECT
*
FROM
SomeTable
LEFT JOIN SomeOtherTable ON SomeTable.ID=SomeOtherTable.ID AND Name = 'ABC'
WHERE
1=1
AND
(
SomeOtherTable.ID IS NOT NULL
OR
0=(SELECT Count(1) FROM SomeOtherTable WHERE spName = 'ABC')
)
also put 'with (nolock)' after each table name to improve performance
The following might speed you up
SELECT *
FROM SomeTable
WHERE
SomeTable.ID in
(SELECT DISTINCT ID FROM SomeOtherTable Where Name = 'ABC')
UNION
SELECT *
FROM SomeTable
Where
NOT EXISTS (Select spName From SomeOtherTable Where spName = 'ABC')
The UNION will effectivly split this into two simpler queries which can be optiomised separately (depends very much on DBMS, table size etc whether this will actually improve performance -- but its always worth a try).
The "EXISTS" key word is more efficient than the "SELECT COUNT(1)" as it will return true as soon as the first row is encountered.
Or check if the value exists in db first
And you can remove the distinct keyword in your query, it is useless here.
if EXISTS (Select spName From SomeOtherTable Where spName = 'ABC')
begin
SELECT *
FROM SomeTable
WHERE
SomeTable.ID in
(SELECT ID FROM SomeOtherTable Where Name = 'ABC')
end
else
begin
SELECT *
FROM SomeTable
end
Aloha
Try
select t.* from SomeTable t
left outer join SomeOtherTable o
on t.id = o.id
where (not exists (select id from SomeOtherTable where spname = 'adbc')
OR spname = 'adbc')
-Edoode
change all your select statements in the where part to inner jons.
the OR conditions should be union all-ed.
also make sure your indexing is ok.
sometimes it pays to have an intermediate table for temp results to which you can join to.
It seems to me that there is no need for the "1=1 AND" in your query. 1=1 will always evaluate to be true, leaving the software to evaluate the next part... why not just skip the 1=1 and evaluate the juicy part?
I am going to stick to my original Query.

Resources