how to get multiple sets of distinct values - sql-server

This is not about distinct combinations of values (Select distinct col1, col2 from table)
I have a table with a newly loaded csv file.
Some columns are linked to foreign key dimensions but the values in a given column may not exist in the reference tables.
My desire is to find all the values in each column that do not exist but in such a way as to minimize the amount of table scans in our source table.
My current approach consumes the output of a bunch of queries like the following:
SELECT DISTINCT col2 FROM table WHERE col2 NOT IN (SELECT val FROM DimCol2)
SELECT DISTINCT col3 FROM table WHERE col3 NOT IN (SELECT val FROM DimCol3)
however, for N columns, this results in N table scans.
Table is up to 10M rows and columns range in cardinality from 5 through to 5M, but almost all values are already present in the dim tables (>99%).
DimColN ranges in size from 5 values to 50M values, and is well indexed.
The csv is loaded into table via SSIS, so splitting pre-processing inside SSIS is possible, but i would have to avoid a sql query for each row.
The ssis server does not have enough spare ram to cache all the dim tables.

What about using a LEFT JOIN and checking where the results of the join are null, meaning they don't exist in DimCol2
SELECT DISTINCT Col2
FROM table a
LEFT JOIN DimCol2 on a.Col2 = b.val
WHERE b.val IS NULL

Related

SQL Query to return rows with the most columns populated

Azure SQL Server 2019.
We have a table Table1 with over 100 columns of differing types of nvarchar data, all of which are allowed NULL values, and where there could be anywhere from 1 to 100 columns populated in a given record. I need to formulate a query that returns the rows ranked by how many columns have values in them, in descending order.
I started going down a road of using DATALENGTH and having to type out the name of every single column, but I can only imagine there has to be a more efficient way. Assuming the column names are column1, column2, column3 etc, how would I accomplish this?
How about a lateral join that unpivots the columns to rows? This requires enumerating the columns just once, like so:
select t.*, c.cnt
from mytable t
cross apply (
select count(*) cnt
from (values (t.column1), (t.column2), (t.column3)) x(col)
where col is not null
) c
order by c.cnt desc

String or binary data would be truncated error in SQL server. How to know the column name throwing this error

I have an insert Query and inserting data using SELECT query and certain joins between tables.
While running that query, it is giving error "String or binary data would be truncated".
There are thousands of rows and multiple columns I am trying to insert in that table.
So it is not possible to visualize all data and see what data is throwing this error.
Is there any specific way to identify which column is throwing this error? or any specific record not getting inserted properly and resulted into this error?
I found one article on this:
RareSQL
But this is when we insert data using some values and that insert is one by one.
I am inserting multiple rows at the same time using SELECT statements.
E.g.,
INSERT INTO TABLE1 VALUES (COLUMN1, COLUMN2,..) SELECT COLUMN1, COLUMN2,.., FROM TABLE2 JOIN TABLE3
Also, in my case, I am having multiple inserts and update statements and even not sure which statement is throwing this error.
You can do a selection like this:
select TABLE2.ID, TABLE3.ID TABLE1.COLUMN1, TABLE1.COLUMN2, ...
FROM TABLE2
JOIN TABLE3
ON TABLE2.JOINCOLUMN1 = TABLE3.JOINCOLUMN2
LEFT JOIN TABLE1
ON TABLE1.COLUMN1 = TABLE2.COLUMN1 and TABLE1.COLUMN2 = TABLE2.COLUMN2, ...
WHERE TABLE1.ID = NULL
The first join reproduces the selection you have been using for the insert and the second join is a left join, which will yield null values for TABLE1 if a row having the exact column values you wanted to insert does not exist. You can apply this logic to your other queries, which were not given in the question.
You might just have to do it the hard way. To make it a little simpler, you can do this
Temporarily remove the insert command from the query, so you are getting a result set out of it. You might need to give some of the columns aliases if they don't come with one. Then wrap that select query as a subquery, and test likely columns (nvarchars, etc) like this
Select top 5 len(Col1), *
from (Select col1, col2, ... your query (without insert) here) A
Order by 1 desc
This will sort the rows with the largest values in the specified column first and just return the rows with the top 5 values - enough to see if you've got a big problem or just one or two rows with an issue. You can quickly change which column you're checking simply by changing the column name in the len(Col1) part of the first line.
If the subquery takes a long time to run, create a temp table with the same columns but with the string sizes large (like varchar(max) or something) so there are no errors, and then you can do the insert just once to that table, and run your tests on that table instead of running the subquery a lot
From this answer,
you can use temp table and compare with target table.
for example this
Insert into dbo.MyTable (columns)
Select columns
from MyDataSource ;
Become this
Select columns
into #T
from MyDataSource;
select *
from tempdb.sys.columns as TempCols
full outer join MyDb.sys.columns as RealCols
on TempCols.name = RealCols.name
and TempCols.object_id = Object_ID(N'tempdb..#T')
and RealCols.object_id = Object_ID(N'MyDb.dbo.MyTable)
where TempCols.name is null -- no match for real target name
or RealCols.name is null -- no match for temp target name
or RealCols.system_type_id != TempCols.system_type_id
or RealCols.max_length < TempCols.max_length ;

Which is the fastest way to run this SQL query?

I have a table (let's call it A) in SQL Server 2016 that I want to query on. I need to select only those rows that have a definitive status, so I need to exclude some rows. There's another table (B), containing the record id from the Table A and two columns, col1 and col2. If these columns are non-empty, the corresponding record can be considered final. There is a one-to-one relationship between tables A and B. Because these tables are rather large, I want to use the most efficient query. Which should I choose?
SELECT *
FROM TableA
WHERE record_id IN
(SELECT record_id FROM TableB WHERE col1 IS NOT NULL AND col2 IS NOT NULL)
SELECT a.*
FROM TableA a
INNER JOIN TableB b ON a.record_id = b.record_id
WHERE b.col1 IS NOT NULL AND b.col2 IS NOT NULL
SELECT a.*
FROM TableA a
INNER JOIN TableB b
ON a.record_id = b.record_id
AND b.col1 IS NOT NULL
AND b.col2 IS NOT NULL
Of course, if there's an even faster way that I hadn't thought of, please share. I'd also be very curious to know why one query is faster than the others.
WITH cte AS
(SELECT b.record_id, b.col1, b.col2
FROM TableB b
WHERE col1 IS NULL
AND col2 IS NULL --if the field isn't NULL, it might be quicker to do <> '')
SELECT a.record_id, a.identifyColumnsNeededExplicitely
FROM cte
JOIN TableA a ON a.record_id = cte.record_id
ORDER BY a.record_id
In practice the execution plan will do whatever it likes depending on your current indexes / clustered index / foreign keys / constraints / table stastics (aka number of rows / general containt of your rows/...). Any analysis should be done case by case and what's true for 2 tables may not be to 2 others table.
Theorically,
Without any index, the first one should be the best since it will make an optimization on operations with 1 table scan on TableB, 2 contants scan on TableB and 1 table scan on Table1.
With a foreign key on TableA.record_id referencing TableB.record_id OR an index in both column, the second should be faster since it will make a scan index and 2 constant scan.
In rare case, it could be the 3rd one depending on TableB stats. But not far from number 2 since number 3 will scan all the TableB.
In even rarer case, neither of the 3.
What I'm tryng to say is : "Since we don't have neither your tables nor rows, open your SQL Management, put the stats ON and try it yourself."

2 nvarchar fields are not matching though the data is same?

I want to join 2 tables using an Inner Join on 2 columns, both are of (nvarchar, null) type. The 2 columns have the same data in them, but the join condition is failing. I think it is due to the spaces contained in the column values.
I have tried LTRIM, RTRIM also
My query:
select
T1.Name1, T2.Name2
from
Table1 T1
Inner Join
Table2 on T1.Name1 = T2.Name2
I have also tried like this:
on LTRIM(RTRIM(T1.Name1)) = LTRIM(RTRIM(T2.Name2))
My data:
Table1 Table2
------ ------
Name1(Column) Name2(Column)
----- ------
Employee Data Employee Data
Customer Data Customer Data
When I check My data in 2 tables with
select T1.Name1,len(T1.Name1)as Length1,Datalength(T1.Name1)as DataLenght1 from Table1 T1
select T2.Name2,len(T2.Name2)as Length2,Datalength(T2.Name2)as DataLenght2 from Table2 T2
The result is different Length and DataLength Values for the 2 tables,They are not same for 2 tables.
I can't change the original data in the 2 tables. How can I fix this issue.
Thank You
Joins do not have special rules for equality. The equality operator always works the same way. So if a = b then the join on a = b would work. Therefore, a <> b.
Check the contents of those fields. They will not be the same although you think they are:
select convert(varbinary(max), myCol) from T
Unicode has invisible characters (that only ever seem to cause trouble).
declare #t table (name varchar(20))
insert into #t(name)values ('Employee Data'),('Customer Data')
declare #tt table (name varchar(20))
insert into #tt(name)values ('EmployeeData'),('CustomerData')
select t.name,tt.name from #t t
INNER JOIN #tt tt
ON RTRIM(LTRIM(REPLACE(t.name,' ',''))) = RTRIM(LTRIM(REPLACE(tt.name,' ','')))
I would follow the following schema
Create a new table to store all the possible names
Add needed keys and indexes
Populate with existing names
Add columns to your existing tables to store the index of the name
Create relative foreign keys
Populate the new columns with correct indexes
Create procedure to perform an insert in new tables names only in case the value is not existing
Perform the join using the new indexes

Remove duplicate from a staging file

I have a staging table which contains a who series of rows of data which where taken from a data file.
Each row details a change to a row in a remote system, the rows are effectively snapshots of the source row taken after every change. Each row contains meta data timestamps for creation and updates.
I am now trying to build an update table from these data files which contain all of the update. I require a way to remove rows with duplicate keys keeping only the row with the latest "update" timestamp.
I am aware I can use the SSIS "sort" transform to remove duplicates by sorting on the key field and telling it to remove duplicates, but how do I ensure that the row it keeps is the one with the latest time stamp?
This will remove rows with match on Col1, Col2 etc and have an UpdateDate that is NOT the most recent:
DELETE D
FROM MyTable AS D
JOIN MyTable AS T
ON T.Col1 = D.Col1
AND T.Col2 = D.Col2
...
AND T.UpdateDate > D.UpdateDate
If Col1 and Col2 need to be considered "matching" if they are both NULL then you would need to use:
ON (T.Col1 = D.Col1 OR (T.Col1 IS NULL AND D.Col1 IS NULL))
AND (T.Col2 = D.Col2 OR (T.Col2 IS NULL AND D.Col2 IS NULL))
...
Edit: If you need to make a Case Sensitive test on a Case INsensitive database then on VARCHAR and TEXT columns use:
ON (T.Col1 = D.Col1 COLLATE Latin1_General_BIN
OR (T.Col1 IS NULL AND D.Col1 IS NULL))
...
You can use the Sort Transform in SSIS to sort your data set by more than one column. Simply sort by your primary key (or ID field) followed by your timestamp column in descending order.
See the following article for more details on working with the sort Transformation?
http://msdn.microsoft.com/en-us/library/ms140182.aspx
Make sense?
Cheers, John
Does it make sense to just ignore the duplicates when moving from staging to final table?
You have to do this anyway, so why not issue one query against the staging table rather than two?
INSERT final
(key, col1, col2)
SELECT
key, col1, col2
FROM
staging s
JOIN
(SELECT key, MAX(datetimestamp) maxdt FROM staging ms ON s.key = ms.key AND s.datetimestamp = ms.maxdt

Resources