Searching for the unique key (not in meta) - sql-server

With which SQL Server standard tool it is possible to search unique key in the table's data (but not in meta declaration)?
P.S. I am thinking to write such script by myself. May be you could point a snippet for
combinatorics in t-sql? e.g. for generation all Combinations from n by 1..n ?
P.P.S About problem complexity for those who do not see it. It is important that we do not need to analyze the whole data to dismiss the hypnotize that those two columns is the 'unique key'. With real world, 'report-like', sorted data even after analysing first two rows, I think, it is possible to remove many of columns combinations. So I feel such algorithm should have 'before full table compare' phase. But there it is a question for what portion of data to choose for this 'before full table compare' phase . The best candidate about which I think is the 'page'... If data unique in the page we could test the uniqueness on whole table, if not unique (on the page), then go to the next column set.

select t1.col, count(*)
from table t1
join table t2
on t1.col = t2.col
group by t1.col
having count(*) > 1
if zero rows are returned then it is unique
more than one column
select t1.cola, t1.colb, count(*)
from table t1
join table t2
on t1.cola = t2.cola
and t1.colb = t2.colb
group by t1.cola, t2.colb
having count(*) > 1

Related

Need to avoid repetition results in sql table

I have written a SQL command to get the different entries from two tables. But table 3 is not a steady one it fills one by one (It has a dynamic nature because it is filled by a RFID reader). So the difference table (table2) has multiple entries of same. Please help to avoid this adding same entries.
INSERT INTO table2 (EPC)
SELECT PL.EPC
FROM priorityLevel PL
WHERE NOT EXISTS (
SELECT 1
FROM table3 T3
WHERE PL.EPC = T3.EPC
);
I expect not to repeat same entry.

SQL View Optimization

I am trying to build a view that does basically 2 things, whether a record in table 1 is in table 2 and whether a link to another table is still there. it worked on a subset of data, but when i tried to run the full query it timed out in the view designer.
The view worked fine until I added in the check to see whether the link to another table was present.
Initially it joined table A to Table B and filtered out where A.ID wasnt present in the ID column in table B
I was then told that if the link between the person and the address table (stored in table C) was removed then we would have no way of knowing other than to get a full extract of that table again and see which links are no longer present. I am trying to use that check to determine whether to display some data in particular columns
I am using the following structure close to 60 times to choose whether to show information in a column:
Column1 = case when exists (select LinkID from LinkTable C
where cast(C.LinkAddressID as varchar) = A.AddressID
and cast(C.LinkID as varchar) = A.ID)
then Column1
else NULL
end
There is about 1.6m records in Table A just over 4m records in the Link table.
is there a better way to write this query / view that would be more optimized?
Please let me know if more information is needed
Select C.LinkID
From A
Left Join C On C.LinkAddressID = A.AddressID And C.LinkID = A.ID
This will give you C.LinkID if a match exists on the two conditions and NULL if both criteria are not satisfied.
Having indexes / keys such as primary key on A.ID and foreign key relationships based on what is in the join clause will provide very good performance.
As Joe suggested, if for all 60 columns you use the same AddressId and Id fields to match two tables, I believe so you can use something as following query
SELECT
Column1 = CASE WHEN C.LinkID IS NULL THEN NULL ELSE A.Column1 END,
....
FROM A
Left Join LinkTable C
ON C.LinkAddressID = A.AddressID AND C.LinkID = A.ID
Casting data types will definitely disable the advantage from index. So keep away data type cast if possible on joins and in WHERE clauses

TSQL query to merge data from multiple tables that may or may not have matching rows?

For example, suppose we're conducting research where students can take up to 10 different tests, and each table in the database stores all the students' responses for one test. The tables are named after each test as: T1, T2, ... , T10. Suppose each table has a primary key column 'Username' that identifies each student. Students may or may not have completed each test, so there may or may not be a record in each table for each student.
What is the correct SQL Query to return all the test data from all tables, with one row per student (one row per username)? I want the simplest query possible that returns the correct results. I would also like to coalesce the Username fields into a single Username field in the final query.
To clarify, I understand that SQL has a major limitation in that it does not support a syntax to select all columns except one or more fields like "select *[^ExcludeColumn1][^ExcludeColumn2]". To avoid specifically naming all columns in the final query, it would be acceptable to leave all the Username columns there, as long as it includes a coalesced Username field at the beginning named something like RowID.
As for the overall query, one option would be to perform a union all on the username column of all ten tables, then select the distinct usernames across all tables, then perform a series of left joins against the list of distinct usernames on all 10 tables. That would result in a very straightforward query where each left join is performed on the same distinct set of usernames, but I want to avoid a separate up-front query for distinct usernames. (Although if that's the best option, let me know). It would look something like this:
select * from
(select distinct coalesce(t1.Username,t2.Username,...,t10.Username) as RowID from t1,t2,t3,t4,t5,t6,t7,t8,t9,t10) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
Although that is short and easy to write, it is incredibly inefficient and would take hours to run on test tables with 5000+ rows each, so with an adjustment, an equivalent version that runs in a few seconds is:
select * from (
select distinct Username as RowID from (
select Username from t1
union all
select Username from t2
union all
...
select Username from t10
) all_usernames) distinct_usernames
left join t1 on t1.Username = distinct_usernames.RowID
left join t2 on t2.Username = distinct_usernames.RowID
...
left join t10 on t10.Username = distinct_usernames.RowID
I think that what I have above might be the most efficient and correct query (takes only a couple seconds to run and returns correct result set), but I also thought perhaps it could be simplified with some kind of full join. The problem is that full joins get confusing with more than two tables, because without pre-determining the usernames, each subsequent table would have to match records against any of the preceding tables, resulting in a query where each additional table has "[previous table count] + 1" conditions on matching the username.
Assuming that Username is unique in each table, your second query would be the way I would try first, with the slight modifications of removing distinct and simply using union (which implies distinct) rather than union all:
select *
from (
select Username from t1
union
select Username from t2
union
-- ...
select Username from t10
) distinct_usernames
left join t1 on t1.Username = distinct_usernames.Username
left join t2 on t2.Username = distinct_usernames.Username
-- ...
left join t10 on t10.Username = distinct_usernames.Username
From there I would make sure that Username is indexed, possibly even using it as the clustered index. I've also had optimization luck in the past by implementing your distinct_usernames as a temp table (possibly indexed, or an indexed view) at the beginning of the proc, but only testing would determine if that were worthwhile.
A full outer join would require a bunch of or conditions or coalesce arguments, though it could be worth a try on just a few tables to see if the performance is there. I can't try to out-guess what your query engine will like best.
Also, getting just the column names that you want could be done with a query to sys.columns or information_schema.columns and using dynamic SQL to build your query as a string and then executing that.

SQL WHERE NOT EXISTS (skip duplicates)

Hello I'm struggling to get the query below right. What I want is to return rows with unique names and surnames. What I get is all rows with duplicates
This is my sql
DECLARE #tmp AS TABLE (Name VARCHAR(100), Surname VARCHAR(100))
INSERT INTO #tmp
SELECT CustomerName,CustomerSurname FROM Customers
WHERE
NOT EXISTS
(SELECT Name,Surname
FROM #tmp
WHERE Name=CustomerName
AND ID Surname=CustomerSurname
GROUP BY Name,Surname )
Please can someone point me in the right direction here.
//Desperate (I tried without GROUP BY as well but get same result)
DISTINCT would do the trick.
SELECT DISTINCT CustomerName, CustomerSurname
FROM Customers
Demo
If you only want the records that really don't have duplicates (as opposed to getting duplicates represented as a single record) you could use GROUP BY and HAVING:
SELECT CustomerName, CustomerSurname
FROM Customers
GROUP BY CustomerName, CustomerSurname
HAVING COUNT(*) = 1
Demo
First, I thought that #David answer is what you want. But rereading your comments, perhaps you want all combinations of Names and Surnames:
SELECT n.CustomerName, s.CustomerSurname
FROM
( SELECT DISTINCT CustomerName
FROM Customers
) AS n
CROSS JOIN
( SELECT DISTINCT CustomerSurname
FROM Customers
) AS s ;
Are you doing that while your #Tmp table is still empty?
If so: your entire "select" is fully evaluated before the "insert" statement, it doesn't do "run the query and add one row, insert the row, run the query and get another row, insert the row, etc."
If you want to insert unique Customers only, use that same "Customer" table in your not exists clause
SELECT c.CustomerName,c.CustomerSurname FROM Customers c
WHERE
NOT EXISTS
(SELECT 1
FROM Customers c1
WHERE c.CustomerName = c1.CustomerName
AND c.CustomerSurname = c1.CustomerSurname
AND c.Id <> c1.Id)
If you want to insert a unique set of customers, use "distinct"
Typically, if you're doing a WHERE NOT EXISTS or WHERE EXISTS, or WHERE NOT IN subquery,
you should use what is called a "correlated subquery", as in ypercube's answer above, where table aliases are used for both inside and outside tables (where inside table is joined to outside table). ypercube gave a good example.
And often, NOT EXISTS is preferred over NOT IN (unless the WHERE NOT IN is selecting from a totally unrelated table that you can't join on.)
Sometimes if you're tempted to do a WHERE EXISTS (SELECT from a small table with no duplicate values in column), you could also do the same thing by joining the main query with that table on the column you want in the EXISTS. Not always the best or safest solution, might make query slower if there are many rows in that table and could cause many duplicate rows if there are dup values for that column in the joined table -- in which case you'd have to add DISTINCT to the main query, which causes it to SORT the data on all columns.
-- Not efficient at all.
And, similarly, the WHERE NOT IN or NOT EXISTS correlated subqueries can be accomplished (and give the exact same execution plan) if you LEFT OUTER JOIN the table you were going to subquery -- and add a WHERE . IS NULL.
You have to be careful using that, but you don't need a DISTINCT. Frankly, I prefer to use the WHERE NOT IN subqueries or NOT EXISTS correlated subqueries, because the syntax makes the intention clear and it's hard to go wrong.
And you do not need a DISTINCT in the SELECT inside such subqueries (correlated or not). It would be a waste of processing (and for WHERE EXISTS or WHERE IN subqueries, the SQL optimizer would ignore it anyway and just use the first value that matched for each row in the outer query). (Hope that makes sense.)

How can I "subtract" one table from another?

I have a master table A, with ~9 million rows. Another table B (same structure) has ~28K rows from table A. What would be the best way to remove all contents of B from table A?
The combination of all columns (~10) are unique. Nothing more in the form a of a unique key.
If you have sufficient rights you can create a new table and rename that one to A. To create the new table you can use the following script:
CREATE TABLE TEMP_A AS
SELECT *
FROM A
MINUS
SELECT *
FROM B
This should perform pretty good.
DELETE FROM TableA WHERE ID IN(SELECT ID FROM TableB)
Should work. Might take a while though.
one way, just list out all the columns
delete table a
where exists (select 1 from table b where b.Col1= a.Col1
AND b.Col2= a.Col2
AND b.Col3= a.Col3
AND b.Col4= a.Col4)
Delete t2
from t1
inner join t2
on t1.col1 = t2.col1
and t1.col2 = t2.col2
and t1.col3 = t2.col3
and t1.col4 = t2.col4
and t1.col5 = t2.col5
and t1.col6 = t2.col6
and t1.col7 = t2.col7
and t1.col8 = t2.col8
and t1.col9 = t2.col9
and t1.col10 = t2.col0
This is likely to be very slow as you would have to have every col indexed which is highly unlikely in an environment when a table this size has no primary key, so do it during off peak. What possessed you to have a table with 9 million records and no primary key?
If this is something you'll have to do on a regular basis, the first choice should be to try to improve the database design (looking for primary keys, trying to get the "join" condition to be on as few columns as possible).
If that is not possible, the distinct second option is to figure out the "selectivity" of each of the columns (i.e. how many "different" values does each column have, 'name' would be more selective than 'address country' than 'male/female').
The general type of statement I'd suggest would be like this:
Delete from tableA
where exists (select * from tableB
where tableA.colx1 = tableB.colx1
and tableA.colx2 = tableB.colx2
etc. and tableA.colx10 = tableB.colx10).
The idea is to list the columns in order of the selectivity and build an index on colx1, colx2 etc. on tableB. The exact number of columns in tableB would be a result of some trial&measure. (Offset the time for building the index on tableB with the improved time of the delete statement.)
If this is just a one time operation, I'd just pick one of the slow methods outlined above. It's probably not worth the effort to think too much about this when you can just start a statement before going home ...
Is there a key value (or values) that can be used?
something like
DELETE a
FROM tableA a
INNER JOIN tableB b
on b.id = a.id

Resources