Fastest Way to Match Column Values of Tables In SQL Server - sql-server

I have two tables, Table1 and Table2.
Suppose Table1 have the following columns: T1c1, T1c2, T1c3
and Table2 have the following columns: T2c1, T2c2, T2c3
I need to add values of Table2.T2c3 to Table1.T1c3 based on matching pairs between the two tables and their other two columns or just matching one column values if a column has NULL values in one table or both. That is, I need to match Table1.T1c1 values with Table2.T2c1 values and Table1.T1c2 values with Table2.T2c2 values or just match Table1.T1c1 with Table2.T2c1 values and etc if there's NULL.
The problem is, my tables are of very large size; several hundred millions of rows. I need the fastest matching algorithm to fill out values on Table1.T1c3 values.

I think you could abuse Coalesce() to do this in the ON clause of your join:
SELECT table2.c3 + table1.c3
FROM table1
INNER JOIN table2
ON COALESCE(table1.c1, table2.c1) = COALESCE(table2.c1 = table1.c1)
AND COALESCE(table1.c2, table2.c2) = COALESCE(table2.c2 = table1.c2);
That feels like it might be speedier than dropping CASE all of the over the place or hitting this with a bunch of OR conditions but it might be 6 of one half a dozen of the other...
This is assuming that there would never be a null in both c1 and c2 of either table at the same time. Otherwise you may end up with some funky cross-join shenanigans.

Related

How can I check and remove duplicate rows?

Have problem with quite big table, where are some null values in 3 columns - datetime2 (and 2 float columns).
Nice simple request from similar question returns only 2 rows where datetime2 is null, but nothing else (same as lot of others):
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, allRemainingCols
FROM MyTable
GROUP BY allRemainingCols
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
Seems to work without datetime2 column having nulls ??
There is manual workaround, but is there any way to create request or procedure using TSQL only ?
SELECT id,remainingColumns
FROM table
order BY remainingColumns
Compare all columns in XL (15 in my case, placed =ROW() in first column as a check and formula next to last column + auto filter for TRUEs): =AND(B1=B2;C1=C2;D1=D2;E1=E2;F1=F2;G1=G2;H1=H2;I1=I2;J1=J2;K1=K2;L1=L2;M1=M2;N1=N2;O1=O2;P1=P2)
Or compare 3 rows like this and select all non-unique rows
=OR(
AND(B1=B2;C1=C2;D1=D2;E1=E2;F1=F2;G1=G2;H1=H2;I1=I2;J1=J2;K1=K2;L1=L2;M1=M2;N1=N2;O1=O2;P1=P2);
AND(B2=B3;C2=C3;D2=D3;E2=E3;F2=F3;G2=G3;H2=H3;I2=I3;J2=J3;K2=K3;L2=L3;M2=M3;N2=N3;O2=O3;P2=P3)
)
Quite much work to find my particular data/answer...
Most of float numbers were slightly different.
Hard to find, but simple CAST(column as binary) can show these invisible differences...
Like 96,6666666666667 vs 0x0000000000000000000000000000000000000000000040582AAAAAAAAAAD vs 0x0000000000000000000000000000000000000000000040582AAAAAAAAAAB etc.
And visible 96.6666666666667 can return something different way again:
0x0000000000000000000000000000000000000F0D0001AB6A489F2D6F0300

How to put together the results of functions of several tables in one query?

I have two tables, for example ONE and TWO, my tables are bigger and with different structures, but have some fields that coincide.
How can I get the table QUERY?
I can do it with just one table WITH GROUP and SUM, but how can I add the last column in QUERY? Is it possible?
SELECT Date, SUM(Apples), SUM(Oranges) FROM One
GROUP BY Date
SELECT Date, one.place, SUM(Apples), SUM(Oranges), SUM(Grapes)
FROM One left join Two on one.place = two.place and one.[Date] = two.[Date]
GROUP BY Date
SELECT one.Date, one.place, SUM(Apples), SUM(Oranges), SUM(Grapes)
FROM One left join Two on one.place = two.place and one.date = two.date
GROUP BY one.date, one.place
Will return all places in first table and the correct counts

Reference main selection table inside joined selection table

I have two tables with similar data. Wanting to find closest matches for comparison. Here's what I was trying to do:
select a.field1 as a1, b.field1 as b1, a.field2 as a2, b.field2 as b2
from foo a
left join (
select top 1 tmp.field1, tmp.field2
from foo2 tmp
-- The closest match will match the most fields. Add up these.
order by case when tmp.field1 = a.field1 then 1 else 0 end
+ case when tmp.field2 = a.field2 then 1 else 0 end
desc) b on 1 = 1
I can't reference the main selection table in the join though.
Perhaps I'm going about it all wrong. The actual goal is that I was given a spreadsheet of data and told to update a database. The spreadsheet has no PK and is missing many fields that the database has. Also, the database has foreign keys and child data all over. So I don't want to delete/insert. Instead I want to compare values and update wherever possible. So I created two temporary tables and pulled the database records into one and the spreadsheet records into another. Now I'm wanting to work with those two tables to update records, and finally delete/insert where no update is available.
Have you looked up the MERGE statement? It does what you want, although the syntax is a bit tricky.
Article here with half decent examples:
http://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx

How to efficiently join records with separate string table

I have a large table with a lot of duplicate string data. To save space, I have moved the string data to a separate table. My tables now look something like this:
MyRecords
RecordId (int) | FieldA (int) | FieldB (datetime) | FieldC (...) | MyString1Id (int) | MyString2Id (int) | MyString3Id (int) | ...
MyStrings
StringId (int) | StringValue (varchar)
The MyRecords table has about 10 foreign keys to the string table. I have a stored procedure GetMyRecords that retrieves a list of records with the actual string values. This sp now has 10 joins to the string table for each string relation:
SELECT [Field1], [Field2], [Field3], ..., [Strings1].[StringValue], [Strings2].[StringValue], ...
FROM MyRecords INNER JOIN
MyStrings AS Strings1 ON MyRecords.MyString1Id = Strings1.StringId INNER JOIN
MyStrings AS Strings2 ON MyRecords.MyString2Id = Strings2.StringId INNER JOIN
MyStrings AS Strings3 ON MyRecords.MyString3Id = Strings3.StringId INNER JOIN
(more joins)
WHERE [Field1] = #Field1 AND [Field2] = #Field2
GetMyRecords is considerably slower than I would want because of all the joins. How could I improve performance for this sp? Can I somehow turn this into a single join?
The strings table has a clustered primary key on StringId, and all the where fields are in a nonclustered index on the MyRecords table.
You should probably take one further step toward normalization and create a join table. Instead of having the MyStringNId columns in MyRecords, have a third table:
CREATE TABLE RecordsStrings (
RecordId [theDataType] NOT NULL REFERENCES MyRecords (RecordId),
StringId [theDataType] NOT NULL REFERENCES MyStrings (StringId)
)
It is not convenient then to have all the strings in the same row of the returned data from the SELECT (though maybe there's a way to do this with a pivot somehow), so it's probably better to restructure the calling code to deal with results returned from:
SELECT [StringValue]
FROM [MyStrings] s
INNER JOIN [RecordsStrings] rs ON rs.StringId = s.StringId
INNER JOIN [MyRecords] r ON rs.RecordId = r.RecordId
WHERE r.Field1 = #Field1 AND r.Field2 = #Field2
If you need the other fields from MyRecords, you can select those as well, though they would appear in every relevant row. If you have multiple matches on Field1 and Field2, though, that may be helpful.
Can I somehow turn this into a single join?
If it is common for the same combination of strings to occur on multiple rows of MyRecords then it would make sense to store those combinations in a separate table. Then you could do a single join.
So long as you are only storing individual strings, then it is not possible to do this in a single join, since it has to search for each string separately.
You can make the queries easier to read and write by creating a view of the table that includes all of the joins. This will not improve performance, but it will make your queries look a lot better.
How could I improve performance for this sp?
There are things you can do, depending on the form of the data.
If the strings in one field contains (mostly) different information than another field, then you could try putting them into different tables. There is a chance this could improve performance if the maximum length of one field is much smaller than the other or if the number of different values for one field is much smaller than the other.
First step would be to run a performance analysis to see where the problems are.
Just on a lark though, you can pick up a bit of a performance gain by using (nolock) on the joined tables.

How does sql server choose values in an update statement where there are multiple options?

I have an update statement in SQL server where there are four possible values that can be assigned based on the join. It appears that SQL has an algorithm for choosing one value over another, and I'm not sure how that algorithm works.
As an example, say there is a table called Source with two columns (Match and Data) structured as below:
(The match column contains only 1's, the Data column increments by 1 for every row)
Match Data
`--------------------------
1 1
1 2
1 3
1 4
That table will update another table called Destination with the same two columns structured as below:
Match Data
`--------------------------
1 NULL
If you want to update the ID field in Destination in the following way:
UPDATE
Destination
SET
Data = Source.Data
FROM
Destination
INNER JOIN
Source
ON
Destination.Match = Source.Match
there will be four possible options that Destination.ID will be set to after this query is run. I've found that messing with the indexes of Source will have an impact on what Destination is set to, and it appears that SQL Server just updates the Destination table with the first value it finds that matches.
Is that accurate? Is it possible that SQL Server is updating the Destination with every possible value sequentially and I end up with the same kind of result as if it were updating with the first value it finds? It seems to be possibly problematic that it will seemingly randomly choose one row to update, as opposed to throwing an error when presented with this situation.
Thank you.
P.S. I apologize for the poor formatting. Hopefully, the intent is clear.
It sets all of the results to the Data. Which one you end up with after the query depends on the order of the results returned (which one it sets last).
Since there's no ORDER BY clause, you're left with whatever order Sql Server comes up with. That will normally follow the physical order of the records on disk, and that in turn typically follows the clustered index for a table. But this order isn't set in stone, particularly when joins are involved. If a join matches on a column with an index other than the clustered index, it may well order the results based on that index instead. In the end, unless you give it an ORDER BY clause, Sql Server will return the results in whatever order it thinks it can do fastest.
You can play with this by turning your upate query into a select query, so you can see the results. Notice which record comes first and which record comes last in the source table for each record of the destination table. Compare that with the results of your update query. Then play with your indexes again and check the results once more to see what you get.
Of course, it can be tricky here because UPDATE statements are not allowed to use an ORDER BY clause, so regardless of what you find, you should really write the join so it matches the destination table 1:1. You may find the APPLY operator useful in achieving this goal, and you can use it to effectively JOIN to another table and guarantee the join only matches one record.
The choice is not deterministic and it can be any of the source rows.
You can try
DECLARE #Source TABLE(Match INT, Data INT);
INSERT INTO #Source
VALUES
(1, 1),
(1, 2),
(1, 3),
(1, 4);
DECLARE #Destination TABLE(Match INT, Data INT);
INSERT INTO #Destination
VALUES
(1, NULL);
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN #Source Source
ON Destination.Match = Source.Match;
SELECT *
FROM #Destination;
And look at the actual execution plan. I see the following.
The output columns from #Destination are Bmk1000, Match. Bmk1000 is an internal row identifier (used here due to lack of clustered index in this example) and would be different for each row emitted from #Destination (if there was more than one).
The single row is then joined onto the four matching rows in #Source and the resultant four rows are passed into a stream aggregate.
The stream aggregate groups by Bmk1000 and collapses the multiple matching rows down to one. The operation performed by this aggregate is ANY(#Source.[Data]).
The ANY aggregate is an internal aggregate function not available in TSQL itself. No guarantees are made about which of the four source rows will be chosen.
Finally the single row per group feeds into the UPDATE operator to update the row with whatever value the ANY aggregate returned.
If you want deterministic results then you can use an aggregate function yourself...
WITH GroupedSource AS
(
SELECT Match,
MAX(Data) AS Data
FROM #Source
GROUP BY Match
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN GroupedSource Source
ON Destination.Match = Source.Match;
Or use ROW_NUMBER...
WITH RankedSource AS
(
SELECT Match,
Data,
ROW_NUMBER() OVER (PARTITION BY Match ORDER BY Data DESC) AS RN
FROM #Source
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN RankedSource Source
ON Destination.Match = Source.Match
WHERE RN = 1;
The latter form is generally more useful as in the event you need to set multiple columns this will ensure that all values used are from the same source row. In order to be deterministic the combination of partition by and order by columns should be unique.

Resources