Pivot Table vs Parent_ID for 1-Many Relation - database

In a general one-to-many (parent-to-child) relationship, is there a significant efficiency difference between (a) putting parent_id in the child table and (b) using a pivot table of only parent_id, child_id?
NOTE: Assume Oracle if necessary, otherwise answer using a general RDBMS.

If by PIVOT table you mean a many-to-many link table, then no, it will only hamper performance.
You should keep parent_id in the child table.
The many-to-many link table takes an extra JOIN and therefore is less efficient.
Compare the following queries:
SELECT *
FROM child_table c
JOIN child_to_parent cp
ON cp.child = c.id
JOIN parent p
ON p.id = cp.parent
WHERE c.property = 'some_property'
and this one:
SELECT *
FROM child_table c
JOIN parent p
ON p.id = c.parent
WHERE c.property = 'some_property'
The latter one is one JOIN shorter and more efficient.
The only possible exception to that rule is that you run these queries often:
SELECT *
FROM child_table c
JOIN parent_table p
ON p.id = c.parent
WHERE c.id IN (id1, id2, ...)
, i. e. you know the id's of the child rows beforehand.
This may be useful if you use natural keys for your child_table.
In this case yes, the child_to_parent link table will be more efficient, since you can just replace it with the following query:
SELECT *
FROM child_to_parent cp
JOIN parent_table p
ON p.id = cp.parent
WHERE cp.child IN (id1, id2, ...)
and child_to_parent will be always less or equal in size to child_table, and hence more efficient.
However, in Oracle you can achieve just the same result creating a composite index on child_table (id, parent_id).
Since Oracle does not index NULL's, this index will be just like your child_to_parent table, but without the table itself and implied maintenance overhead.
In other systems (which index NULL's), the index may be less efficient than a dedicated table, especially if you have lots of NULL parents.

Related

Is this join overcomplicated?

I have inherited an application made by a previous developer. Some of the database calls are running slow in places where there is a large amount of data. I have found in general the SQL code is well written but there are places that make me think, 'what the..?'
Here is one example:
select a.*
from bs_ResearchEnquiry a
left join bs_StateWorkflowState_Map b
on (
select c.MapId from bs_StateWorkflowState_Map c
where c.StateId = a.StateId AND c.StateWorkflowId = a.StateWorkflowId
)=b.MapId
where
b.IsFinal=1
The MapId field is a unique primary key to the bs_StateWorkflowState_Map table.
StateId and StateWorkflowId together also form a unique key.
There will always be a match on these keys to rows in the foreign table bs_ResearchEnquiry
Therefore, could I rewrite the left join more efficiently, and safely, as:
inner join bs_StateWorkflowState_Map b
on b.StateId = a.StateId AND b.StateWorkflowId = a.StateWorkflowId
Or was the original developer trying to achieve something I've missed ?
Your simplification looks good to me. Note that the presence of:
where b.IsFinal = 1
Means that the outer join is effectively inner join.
With your explanation on keys given, you are right, the query can be simplified. It selects records from bs_ResearchEnquiry where the associated bs_StateWorkflowState_Map record is final. So use EXISTS:
select *
from bs_ResearchEnquiry re
where exists
(
select *
from bs_StateWorkflowState_Map m
where m.StateId = re.StateId
and m.StateWorkflowId = re.StateWorkflowId
and m.IsFinal = 1
);
(From your explanation on uniqueness, I gather that there already exist indexes on (StateId, StateWorkflowId) in both tables. If not, create them.)

Which is the fastest way to run this SQL query?

I have a table (let's call it A) in SQL Server 2016 that I want to query on. I need to select only those rows that have a definitive status, so I need to exclude some rows. There's another table (B), containing the record id from the Table A and two columns, col1 and col2. If these columns are non-empty, the corresponding record can be considered final. There is a one-to-one relationship between tables A and B. Because these tables are rather large, I want to use the most efficient query. Which should I choose?
SELECT *
FROM TableA
WHERE record_id IN
(SELECT record_id FROM TableB WHERE col1 IS NOT NULL AND col2 IS NOT NULL)
SELECT a.*
FROM TableA a
INNER JOIN TableB b ON a.record_id = b.record_id
WHERE b.col1 IS NOT NULL AND b.col2 IS NOT NULL
SELECT a.*
FROM TableA a
INNER JOIN TableB b
ON a.record_id = b.record_id
AND b.col1 IS NOT NULL
AND b.col2 IS NOT NULL
Of course, if there's an even faster way that I hadn't thought of, please share. I'd also be very curious to know why one query is faster than the others.
WITH cte AS
(SELECT b.record_id, b.col1, b.col2
FROM TableB b
WHERE col1 IS NULL
AND col2 IS NULL --if the field isn't NULL, it might be quicker to do <> '')
SELECT a.record_id, a.identifyColumnsNeededExplicitely
FROM cte
JOIN TableA a ON a.record_id = cte.record_id
ORDER BY a.record_id
In practice the execution plan will do whatever it likes depending on your current indexes / clustered index / foreign keys / constraints / table stastics (aka number of rows / general containt of your rows/...). Any analysis should be done case by case and what's true for 2 tables may not be to 2 others table.
Theorically,
Without any index, the first one should be the best since it will make an optimization on operations with 1 table scan on TableB, 2 contants scan on TableB and 1 table scan on Table1.
With a foreign key on TableA.record_id referencing TableB.record_id OR an index in both column, the second should be faster since it will make a scan index and 2 constant scan.
In rare case, it could be the 3rd one depending on TableB stats. But not far from number 2 since number 3 will scan all the TableB.
In even rarer case, neither of the 3.
What I'm tryng to say is : "Since we don't have neither your tables nor rows, open your SQL Management, put the stats ON and try it yourself."

How to improve performance of this SQL Server query?

I was asked this question at web developer interview. after my answer interviewer said your in second table :(
I have two tables employee and bademployee:
employee (empid int pk, name varchar(20)`)
bademployee (badempid int pk, name varchar(20))
Now, I want to select only good employees.
My answer was :
SELECT *
FROM employee
WHERE empid NOT IN (SELECT badempid from bademployee)
He said this query is not good for performance.
Can any one tell me how to write query for same result, by not using negative terms(not in, !=).
Can it be done using LEFT OUTER JOIN ?
This can be rewritten using an OUTER JOIN with a NULL check or by using NOT EXISTS. I prefer NOT EXISTS:
SELECT *
FROM Employee e
WHERE NOT EXISTS (
SELECT 1
FROM bademployee b
WHERE e.empid = b.badempid)
Here is the OUTER JOIN, but I believe you'll have better performace with NOT EXISTS.
SELECT e.*
FROM Employee e
LEFT JOIN bademployee b ON e.empid = b.badempid
WHERE b.badempid IS NULL
Here's an interesting article about the performance differences: http://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join
Whatever someone else may say, you need to check the execution plan and base your conclusion on what that sais. Never just trust someone else that claims this or that, research into his claims and verify that with documentation on the subject and in this case the execution plan which clearly tells you what is going on.
One example from SQL Authority blogs shows that the LEFT JOIN solution performs much worse than the NOT IN solution. This is due to a LEFT ANTI SEMI JOIN done by the query planner which generally performs a lot better than a LEFT JOIN + NULL check. There may be exceptions when there are very few rows. The author also tells you afterwards the same as I did in the first paragraph: always check the execution plan.
Another blog post from SQL Performance blogs goes into this further with actual performance testing results.
TL;DR: In terms of performance NOT EXISTS and NOT IN are on the same level but NOT EXISTS is prefered due to issues with NULL values. Also, don't just trust what anyone claims, research and verify your execution plan.
I think the interviewer was wrong about the performance difference. Because the joined column is unique and not null in both tables, the NOT IN, NOT EXISTS, and LEFT JOIN...WHERE IS NULL queries are semantically identical. SQL is a declarative language so the SQL Server optimizer may provide optimal and identical plans regardless of now the query is expressed. That said, it is not always perfect so there may be variances, especially with more complex queries.
Below is a script that demonstrates this. On my SQL Server 2014 box, I see identical execution plans for the first 2 queries (ordered clustered index scans and a merge join), and the addition of a filter operator in the last. I would expect identical performance with all 3 so it doesn't really matter from a performance perspective. I would generally use NOT EXISTS because the intent is clearer and it avoids the gotcha in the case a NULL is returned by the NOT IN subquery, thus resulting in zero rows returned due to the UNKNOWN predicate result.
I would not generalize performance comparisons like this. If the joined columns allow NULL or are not guaranteed to be unique, these queries are not semantically the same and may yield different execution plans as a result.
CREATE TABLE dbo.employee (
empid int CONSTRAINT pk_employee PRIMARY KEY
, name varchar(20)
);
CREATE TABLE dbo.bademployee (
badempid int CONSTRAINT pk_bademployee PRIMARY KEY
, name varchar(20)
);
WITH
t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n))
,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d)
,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c)
INSERT INTO dbo.employee(empid, name)
SELECT num, 'Employee name ' + CAST(num AS varchar(10))
FROM t16M
WHERE num <= 10000;
INSERT INTO dbo.bademployee(badempid, name)
SELECT TOP 5 PERCENT empid, name
FROM dbo.employee
ORDER BY NEWID();
GO
UPDATE STATISTICS dbo.employee WITH FULLSCAN;
UPDATE STATISTICS dbo.bademployee WITH FULLSCAN;
GO
SELECT *
FROM employee
WHERE empid NOT IN (SELECT badempid from bademployee);
SELECT *
FROM Employee e
WHERE NOT EXISTS (
SELECT 1
FROM bademployee b
WHERE e.empid = b.badempid);
SELECT e.*
FROM Employee e
LEFT JOIN bademployee b ON e.empid = b.badempid
WHERE b.badempid IS NULL;
GO

How to avoid table scan and index scan for huge tables

I am using MSSQL 2008 R2. I have a table with huge number of rows (test table)
I have the following SQL code, please suggest where I can use index hints, force seek or any other means to improve performance.
Indexes
1. non-clustered - idx_id (id)
2. non-clustered - idx_name (name)
SELECT DISTINCT
p.id,
p.name,
FROM
test p
LEFT OUTER JOIN
(
SELECT
e.id
FROM
test e
INNER JOIN
(
SELECT
c.id
FROM
test c
GROUP BY
c.id
HAVING
COUNT(1) > 1
) f
ON e.id = f.id
WHERE
e.name = 'test_name'
) m
ON p.id = m.id
WHERE
m.id is null
Prerequise: have a primary key
select distinct
p.id
, p.name
from test p
where not exists (
SELECT TOP(1)
1
FROM test e
WHERE e.PrimaryKey <> p.PrimaryKey
AND e.id = p.id
AND 'test_name' IN (e.name, p.name)
)
How many columns your table contains? If there's only these two columns, it makes no sense to add nonclustered index. You should create CLUSTERED index on ID column, and that's it - you'll see performance increase.
If you have many colums, consider two options:
Create clustered index on NAME column and nonclustered index on ID column.
Create nonclustered index on ID column, and INCLUDE NAME column (you'll create covering index that way)
Generally speaking, relational databases (being relational) are written in such a way to optimize join statements. When using a "JOIN" clause with "ON" criteria, the database engine can create an optimized execution plan that takes the table structure, indexes, etc. into account. When joining on a sub-select, sometimes the same optimizing factors are not available, or are not taken into account the same way. It depends on your schema, but it is a good rule of thumb to assume that a standard join with an "on" clause is going to be more efficient than a join on a sub-select.
Your schema is pretty vague, so I am not even sure that you need the joins, but if you do, you should try performing the joins directly with "on" criteria.

How can I "subtract" one table from another?

I have a master table A, with ~9 million rows. Another table B (same structure) has ~28K rows from table A. What would be the best way to remove all contents of B from table A?
The combination of all columns (~10) are unique. Nothing more in the form a of a unique key.
If you have sufficient rights you can create a new table and rename that one to A. To create the new table you can use the following script:
CREATE TABLE TEMP_A AS
SELECT *
FROM A
MINUS
SELECT *
FROM B
This should perform pretty good.
DELETE FROM TableA WHERE ID IN(SELECT ID FROM TableB)
Should work. Might take a while though.
one way, just list out all the columns
delete table a
where exists (select 1 from table b where b.Col1= a.Col1
AND b.Col2= a.Col2
AND b.Col3= a.Col3
AND b.Col4= a.Col4)
Delete t2
from t1
inner join t2
on t1.col1 = t2.col1
and t1.col2 = t2.col2
and t1.col3 = t2.col3
and t1.col4 = t2.col4
and t1.col5 = t2.col5
and t1.col6 = t2.col6
and t1.col7 = t2.col7
and t1.col8 = t2.col8
and t1.col9 = t2.col9
and t1.col10 = t2.col0
This is likely to be very slow as you would have to have every col indexed which is highly unlikely in an environment when a table this size has no primary key, so do it during off peak. What possessed you to have a table with 9 million records and no primary key?
If this is something you'll have to do on a regular basis, the first choice should be to try to improve the database design (looking for primary keys, trying to get the "join" condition to be on as few columns as possible).
If that is not possible, the distinct second option is to figure out the "selectivity" of each of the columns (i.e. how many "different" values does each column have, 'name' would be more selective than 'address country' than 'male/female').
The general type of statement I'd suggest would be like this:
Delete from tableA
where exists (select * from tableB
where tableA.colx1 = tableB.colx1
and tableA.colx2 = tableB.colx2
etc. and tableA.colx10 = tableB.colx10).
The idea is to list the columns in order of the selectivity and build an index on colx1, colx2 etc. on tableB. The exact number of columns in tableB would be a result of some trial&measure. (Offset the time for building the index on tableB with the improved time of the delete statement.)
If this is just a one time operation, I'd just pick one of the slow methods outlined above. It's probably not worth the effort to think too much about this when you can just start a statement before going home ...
Is there a key value (or values) that can be used?
something like
DELETE a
FROM tableA a
INNER JOIN tableB b
on b.id = a.id

Resources