comparing data via a function - sql-server

I have two sets of data (locations) in separate tables and I need to compare if they match or not. I have a UDF which performs a calculation based upon 5 values from each table.
How do I perform a select with a join using this udf?
my udf is basically defined by....
ALTER FUNCTION [dbo].[MatchRanking]
(
#Latitude FLOAT
, #Longitude FLOAT
, #Postcode VARCHAR(16)
, #CompanyName VARCHAR(256)
, #TelephoneNumber VARCHAR(32)
, #Latitude2 FLOAT
, #Longitude2 FLOAT
, #Postcode2 VARCHAR(16)
, #CompanyName2 VARCHAR(256)
, #TelephoneNumber2 VARCHAR(32)
)
RETURNS INT
WITH EXECUTE AS CALLER
AS
BEGIN
DECLARE #RetVal INT
DECLARE #PostcodeVal INT
SET #RetVal = 0
SET #PostcodeVal = 0
SET #RetVal = #RetVal + dbo.FuzzyLogicStringMatch(#CompanyName, #CompanyName2)
IF #RetVal = 1 AND dbo.TelephoneNoStringMatch(#TelephoneNumber, #TelephoneNumber2) = 1
RETURN 5
ELSE
IF #RetVal = 1 AND dbo.FuzzyLogicStringMatch(#Postcode, #Postcode2) = 1
RETURN 5
ELSE
IF #RetVal = 1 AND ROUND(#Latitude,4) = ROUND(#Latitude2,4) AND ROUND(#Longitude,4) = ROUND(#Longitude2,4)
RETURN 5
ELSE
IF (#RetVal = 1 AND ROUND(#Latitude,4) = ROUND(#Latitude2,4)) OR (#RetVal = 1 AND ROUND(#Longitude,4) = ROUND(#Longitude2,4))
SET #RetVal = 2
ELSE
BEGIN
IF ROUND(#Latitude,4) = ROUND(#Latitude2,4)
SET #RetVal = #RetVal + 1
IF ROUND(#Longitude,4) = ROUND(#Longitude2,4)
SET #RetVal = #RetVal + 1
SET #RetVal = #RetVal + dbo.TelephoneNoStringMatch(#TelephoneNumber, #TelephoneNumber2)
SET #RetVal = #RetVal + dbo.FuzzyLogicStringMatch(#Postcode, #Postcode2)
END
RETURN #RetVal
END
This is the previous code that I am trying to fix:
SELECT li.LImportId, l.LocationId, dbo.MatchRanking(li.Latitude, li.Longitude, li.[Name], li.Postcode, li.TelephoneNumber,
l.Latitude, l.Longitude, l.CompanyName, l.Postcode, l.TelephoneNumber
) AS [MatchRanking]
FROM #LocImport li
LEFT JOIN [Location] l
ON lI.[Latitude] = l.[Latitude]
OR lI.[Longitude] = l.[Longitude]
OR lI.[Postcode] = l.[Postcode]
OR lI.[Name] = l.[CompanyName]
OR lI.[TelephoneNumber] = l.[TelephoneNumber]

What was wrong with your original JOIN? that should perform much faster than this function.
This should do it, but I think it will be really slow:
SELECT
...
FROM Table1 t1
CROSS JOIN Table2 t2
WHERE dbo.MatchRanking(t1.Latitude ,..,..,t2.Latitude ,..)=1 --"1" or whatever return value is a match

One thing we did was set up a spearate table to store the results of the cross join so they only have to be calcluated once. Then have a job that runs nightly to pick up any new records and populated them against all the old records. After all the lat/longs are not going to change (except for the occasional type which the nightly job can find and fix) and it doesn't make sense to do this calculation everytime you run a query to find the distances when the calulation will alwys be the same for most of the numbers. Once this data is in a table, you can easily query very quickly. Populating the table the first time might take awhile.

Personally, I would do this kind of fuzzy data matching in several passes.
-create an xref table which contain the keys for the records that match
-get all that match exactly and insert the keys in the xref table
-"fuzzify" your criteria and search again but only in those records that do not already have a match in the xref.
-rinse and repeat by expanding and/or fuzzifying your criteria until the matches you get are garbage.

You will have to cross join the two tables (m and n yields m x n compares) first and compare to find matches - there is really no other simple way.
However, with meta-understanding of the data and its interpretation and your goals, if you can somehow filter your set to eliminate items from the cross join fairly easily, that would help - especially if there's anything that has to be an exact match, or any way to partition the data so that items in different partitions would never be compared (i.e. comparing a US and Europe location would always have match rank 0)
I would say that you could use the Lat and Long to eliminate completely if they differ by a certain amount, but it looks like they are used to improve the match ranking, not to negatively eliminate items from being match ranked.
And scalar functions (your FuzzyMatches) called repeatedly (like in a multi-million row cross join) are tremendously expensive.
It looks to me like you could extrac the first match and the inner else to take place in your cross join (and inline if possible, not as a UDF) so that they can be somewhat optimized by the query optimizer in conjunction with the cross join instead of in a black box called m x n times.
Another possibility is to pre-extract only the distinct Phone number pairs, post code pairs etc.
SELECT Postcode1, Postcode2, dbo.FuzzyLogicStringMatch(Postcode1, Postcode2) AS MatchRank
FROM (
SELECT DISTINCT Postcode AS Postcode1
FROM Table1
) AS Postcodes1
CROSS JOIN
(
SELECT DISTINCT Postcode AS Postcode2
FROM Table2
) AS Postcodes2
If your function is symmetric, you can further reduce this space over which to call the UDF with some extra work (it's easier if the table is a self-join, you just use an upper right triangle).
Now you have the minimal set of compares for your scalar UDF to be called over. Put this result into a table indexed on the two columns.
You can do similar for all your UDF parameter sets. If the function definition isn't changing, you only need to add new combinations to your table over time, turning an expensive scalar function call into a table-lookup on relatively slower growing data. You can even use a CASE statement to fall back on the UDF call inline if the lookup table doesn't have an entry - so you can decide whether to keep the lookup tables comprehensive or not.

Related

IIF with Multiple Case Statements for Computed Column

I'm trying to add this as a Formula (Computed Column) but I'm getting an error message saying it is not valid.
Can anyone see what is wrong with the below formula?
IIF
(
select * from Config where Property = 'AutomaticExpiry' and Value = 1,
case when [ExpiryDate] IS NULL OR sysdatetimeoffset()<[ExpiryDate] then 1 else 0 end,
case when [ExpiryDate] IS NULL then 1 else 0 end
)
From BOL: ALTER TABLE computed_column_definition
computed_column_expression Is an expression that defines the value of
a computed column. A computed column is a virtual column that is not
physically stored in the table but is computed from an expression that
uses other columns in the same table. For example, a computed column
could have the definition: cost AS price * qty. The expression can be
a noncomputed column name, constant, function, variable, and any
combination of these connected by one or more operators. The
expression cannot be a subquery or include an alias data type.
Wrap the login in function. Something like this:
CREATE FUNCTION [dbo].[fn_CustomFunction]
(
#ExpireDate DATETIME2
)
RETURNS BIT
AS
BEGIN;
DECLARE #Value BIT = 0;
IF EXISTS(select * from Config where Property = 'AutomaticExpiry' and Value = 1)
BEGIN;
SET #Value = IIF (sysdatetimeoffset()< #ExpireDate, 1, 0)
RETURN #value;
END;
RETURN IIF(#ExpireDate IS NULL, 1, 0);
END;
GO
--DROP TABLE IF EXISTS dbo.TEST;
CREATE TABLE dbo.TEST
(
[ID] INT IDENTITY(1,1)
,[ExpireDate] DATETIME2
,ComputeColumn AS [dbo].[fn_CustomFunction] ([ExpireDate])
)
GO
INSERT INTO dbo.TEst (ExpireDate)
VALUES ('2019-01-01')
,('2018-01-01')
,(NULL);
SELECT *
FROM dbo.Test;
Youre trying to do something, what we're not quite sure - you've made a classic XY problem mistake.. You have some task, like "implement auto login expiry if it's on in the prefs table" and you've devised this broken solution (use a computed column/IIF) and have sought help to know why it's broken.. It's not solving the actual core problem.
In transitioning from your current state to one where you're solving the problem, you can consider:
As a view:
CREATE VIEW yourtable_withexpiry AS
SELECT
*,
CASE WHEN [ExpiryDate] IS NULL OR config.[Value] = 1 AND SysDateTimeOffset() < [ExpiryDate] THEN 1 ELSE 0 END AS IsValid
FROM
yourtable
LEFT JOIN
config
ON config.property = 'AutomaticExpiry'
As a trigger:
CREATE TRIGGER trg_withexpiry ON yourtable
AFTER INSERT OR UPDATE
AS
IF NOT EXISTS(select * from Config where Property = 'AutomaticExpiry' and Value = 1)
RETURN;
UPDATE yourtable SET [ExpiryDate] = DATE_ADD(..some current time and suitable offset here..)
FROM yourtable y INNER JOIN inserted i ON y.pk = i.pk;
END;
But honestly, you should be doing this in your front end app. It should be responsible for reading/writing session data and keeping things up to date and kicking users out if they're over time etc.. Using the database for this is, to a large extent, putting business logic/decision processing into a system that shouldn't be concerned with it..
Have your front end language implement a code that looks up user info upon some regular event (like page navigation or other activity) and refreshes the expiry date as a consequence of the activity, only if the expiry date isn't passed. For sure too keep the thing valid if the expiry is set to null if you want a way to have people active forever (or whatever)

How many times is function called

Lets have a following query:
SELECT * FROM {tablename} WHERE ColumnId = dbo.GetId()
where dbo.GetId() is non-deterministic user defined function. The question is whether dbo.GetId() is called only once for entire query and its result is then applied or is it called for each row? I think it is called for every row, but I don't know of any way how to prove it.
Also would following query be more efficient?
DECLARE #Id int
SET #Id = dbo.GetId()
SELECT * FROM {tablename} WHERE ColumnId = #Id
I doubt this is guaranteed anywhere. Use a variable if you want to ensure it.
I amended #Prdp's example
CREATE VIEW vw_rand
AS
SELECT Rand() ran
GO
/*Return 0 or 1 with 50% probability*/
CREATE FUNCTION dbo.Udf_non_deterministic ()
RETURNS INT
AS
BEGIN
RETURN
(SELECT CAST(10000 * ran AS INT) % 2
FROM vw_rand)
END
go
SELECT *
FROM master..spt_values
WHERE dbo.Udf_non_deterministic() = 1
In this case it is only evaluated once. Either all rows are returned or zero.
The reason for this is that the plan has a filter with a startup predicate.
The startup expression predicate is [tempdb].[dbo].[Udf_non_deterministic]()=(1).
This is only evaluated once when the filter is opened to see whether to get rows from the subtree at all - not for each row passing through it.
But conversely the below returns a different number of rows each time indicating that it is evaluated per row. The comparison to the column prevents it being evaluated up front in the filter as with the previous example.
SELECT *
FROM master..spt_values
WHERE dbo.Udf_non_deterministic() = (number - number)
And this rewrite goes back to evaluating once (for me) but CROSS APPLY still gave multiple evaluations.
SELECT *
FROM master..spt_values
OUTER APPLY(SELECT dbo.Udf_non_deterministic() ) AS C(X)
WHERE X = (number - number)
Here is one way to prove it
View
View is created to add a Nondeterministic inbuilt Functions inside user defined function
CREATE VIEW vw_rand
AS
SELECT Rand() ran
Nondeterministic Functions
Now create a Nondeterministic user defined Functions using the above view
CREATE FUNCTION Udf_non_deterministic ()
RETURNS FLOAT
AS
BEGIN
RETURN
(SELECT ran
FROM vw_rand)
END
Sample table
CREATE TABLE #test
(
id INT,
name VARCHAR(50)
)
INSERT #test
VALUES (1,'a'),
(2,'b'),
(3,'c'),
(4,'d')
SELECT dbo.Udf_non_deterministic (), *
FROM #test
Result:
id name non_deterministic_val
1 a 0.203123494465542
2 b 0.888439497446073
3 c 0.633749721616085
4 d 0.104620204364744
As you can see for all the rows the function is called
Yes it does get called once per row.
See following thread for debugging functions
SQL Functions - Logging
And yes the below query is efficient as the function is called only once.
DECLARE #Id int
SET #Id = dbo.GetId()
SELECT * FROM {tablename} WHERE ColumnId = #Id

Optimizing sql server scalar-valued function

Here is my question,
I have a view calling another view. And that second view has a scalar function which obviously runs for each row of the table. For only 322 rows, it takes around 30 seconds. When I take out the calculated field, it takes 1 second.
I appreciate if you guys give me an idea if I can optimize the function or if there is any other way to increase the performance?
Here is the function:
ALTER FUNCTION [dbo].[fnCabinetLoad] (
#site nvarchar(15),
#cabrow nvarchar(50),
#cabinet nvarchar(50))
RETURNS float
AS BEGIN
-- Declare the return variable here
DECLARE #ResultVar float
-- Add the T-SQL statements to compute the return value here
SELECT #ResultVar = SUM(d.Value)
FROM
(
SELECT dt.*,
ROW_NUMBER()
OVER (PARTITION BY dt.tagname ORDER BY dt.timestamp DESC) 'RowNum'
FROM vDataLog dt
WHERE dt.Timestamp BETWEEN dateadd(minute,-15,getdate()) AND GetDate()
) d
INNER JOIN [SKY_EGX_CONFIG].[dbo].[vPanelSchedule] AS p
ON p.rpp = left(d.TagName,3) + substring(d.TagName,5,5)
+ substring(d.TagName,11,8)
AND right(p.pole,2) = substring(d.TagName,23,2)
AND p.site = #site
AND p.EqpRowNumber = #cabrow
AND p.EqpCabinetName= #cabinet
WHERE d.RowNum = 1
AND Right(d.TagName, 6) = 'kW Avg'
RETURN #ResultVar
END
Scalar-valued functions have atrocious performance. Your function looks like an excellent candidate for an inline table-valued function that you can CROSS APPLY:
CREATE FUNCTION [dbo].[fnCabinetLoad]
(
#site nvarchar(15),
#cabrow nvarchar(50),
#cabinet nvarchar(50)
)
RETURNS TABLE
AS RETURN
SELECT SUM(d.Value) AS [TotalLoad]
FROM
(
SELECT dt.*, ROW_NUMBER() OVER (PARTITION BY dt.tagname ORDER BY dt.timestamp DESC) 'RowNum'
FROM vDataLog dt
WHERE dt.Timestamp BETWEEN dateadd(minute,-15,getdate()) AND GetDate()) d INNER JOIN [SKY_EGX_CONFIG].[dbo].[vPanelSchedule] AS p
ON p.rpp = left(d.TagName,3) + substring(d.TagName,5,5) + substring(d.TagName,11,8)
AND right(p.pole,2) = substring(d.TagName,23,2)
AND p.site = #site
AND p.EqpRowNumber = #cabrow
AND p.EqpCabinetName= #cabinet
WHERE d.RowNum = 1
AND Right(d.TagName, 6) = 'kW Avg'
In your view:
SELECT ..., cabinetLoad.TotalLoad
FROM ... CROSS APPLY dbo.fnCabinetLoad(.., .., ..) AS cabinetLoad
My understanding is the returned result set is 322 rows, but if the vDataLog table is significantly larger, I would run that subquery first and dump that result set into a table variable. Then, you can use that table variable instead of a nested query.
Otherwise, as it stands now, I think the joins are being done on all rows of the nested query and then you're stripping them off with the where clause to get the rows you want.
You really don't need a function and get rid of nested view(very poor performant)! Encapsulate the entire logic in a stored proc to get the desired result, so that instead of computing everything row by row, it's computed as a set. Instead of view, use the source table to do the computation inside the stored proc.
Apart from that, you are using the functions RIGHT, LEFT AND SUBSTRING inside your code. Never have them in WHERE OR JOIN. Try to compute them before hand and dump them into a temp table so that they are computed once. Then index the temp tables on these columns.
Sorry for the theoretical answer, but right now code seems a mess. It needs to go through layers of changes to have decent performance.
Turn the function into a view.
Use it by restraining on the columns site, cabrow and cabinet and Timestamp. When doing that, try storing GetDate() and dateadd(minute,-15,getdate()) on a variable. I think not doing so can prevent you from taking advantage on any index on Timestamp.
SELECT SUM(d.Value) AS [TotalLoad],
dt.Timestamp,
p.site,
p.EqpRowNumber AS cabrow,
p.EqpCabinetName AS cabinet
FROM
( SELECT dt.*,
ROW_NUMBER() OVER (PARTITION BY dt.tagname ORDER BY dt.timestamp DESC)'RowNum'
FROM vDataLog dt) d
INNER JOIN [SKY_EGX_CONFIG].[dbo].[vPanelSchedule] AS p
ON p.rpp = left(d.TagName,3) + substring(d.TagName,5,5) + substring(d.TagName,11,8)
AND right(p.pole,2) = substring(d.TagName,23,2)
WHERE d.RowNum = 1
AND d.TagName LIKE '%kW Avg'

Comparing with null by converting value to a constant with isnull

One of our programmers tends to use isnull in MS SQL to compare with NULLs.
That is instead of writing Where ColumnName Is Null he writes Where IsNull(ColumnName,0)=0
I think that the optimizer will convert the latter into the former anyway, but if it does not - is there a way to prove that the latter is less effective, since it
1.Compares with null,
2.Converts to integer,
3.Compares 2 integers
instead of just comparing with null.
Both ways are really fast for me to be able to use the execution plans (and also I think, that the optimizer plays its part). Is there a way to prove him that just comparing with Null without IsNull is more effective (unless it's not).
Another obvious issue is the ISNULL precludes the use of indexes.
Run this setup:
create table T1 (
ID int not null primary key,
Column1 int null
)
go
create index IX_T1 on T1(Column1)
go
declare #i int
set #i = 10000
while #i > 0
begin
insert into T1 (ID,Column1) values (#i,CASE WHEN #i%1000=0 THEN NULL ELSE #i%1000 END)
set #i = #i - 1
end
go
Then turn on execution plans and run the following:
select * from T1 where Column1 is null
select * from T1 where ISNULL(Column1,0)=0
The first uses an index seek (using IX_T1) and is quite efficient. The second uses an index scan on the clustered index - it has to look at every row in the table.
On my machine, the second query took 90% of the time, the first 10%.
IsNull is not well used if you are using it in the where clause and comparing it to 0, the use of isnull is to replace the value null
http://msdn.microsoft.com/en-us/library/ms184325.aspx
For example:
SELECT Description, DiscountPct, MinQty, ISNULL(MaxQty, 0.00) AS 'Max Quantity'
FROM Sales.SpecialOffer;

tsql bulk update

MyTableA has several million records. On regular occasions every row in MyTableA needs to be updated with values from TheirTableA.
Unfortunately I have no control over TheirTableA and there is no field to indicate if anything in TheirTableA has changed so I either just update everything or I update based on comparing every field which could be different (not really feasible as this is a long and wide table).
Unfortunately the transaction log is ballooning doing a straight update so I wanted to chunk it by using UPDATE TOP, however, as I understand it I need some field to determine if the records in MyTableA have been updated yet or not otherwise I'll end up in an infinite loop:
declare #again as bit;
set #again = 1;
while #again = 1
begin
update top (10000) MyTableA
set my.A1 = their.A1, my.A2 = their.A2, my.A3 = their.A3
from MyTableA my
join TheirTableA their on my.Id = their.Id
if ##ROWCOUNT > 0
set #again = 1
else
set #again = 0
end
is the only way this will work if I add in a
where my.A1 <> their.A1 and my.A2 <> their.A2 and my.A3 <> their.A3
this seems like it will be horribly inefficient with many columns to compare
I'm sure I'm missing an obvious alternative?
Assuming both tables are the same structure, you can get a resultset of rows that are different using
SELECT * into #different_rows from MyTable EXCEPT select * from TheirTable and then update from that using whatever key fields are available.
Well, the first, and simplest solution, would obviously be if you could change the schema to include a timestamp for last update - and then only update the rows with a timestamp newer than your last change.
But if that is not possible, another way to go could be to use the HashBytes function, perhaps by concatenating the fields into an xml that you then compare. The caveat here is an 8kb limit (https://connect.microsoft.com/SQLServer/feedback/details/273429/hashbytes-function-should-support-large-data-types) EDIT: Once again, I have stolen code, this time from:
http://sqlblogcasts.com/blogs/tonyrogerson/archive/2009/10/21/detecting-changed-rows-in-a-trigger-using-hashbytes-and-without-eventdata-and-or-s.aspx
His example is:
select batch_id
from (
select distinct batch_id, hash_combined = hashbytes( 'sha1', combined )
from ( select batch_id,
combined =( select batch_id, batch_name, some_parm, some_parm2
from deleted c -- need old values
where c.batch_id = d.batch_id
for xml path( '' ) )
from deleted d
union all
select batch_id,
combined =( select batch_id, batch_name, some_parm, some_parm2
from some_base_table c -- need current values (could use inserted here)
where c.batch_id = d.batch_id
for xml path( '' ) )
from deleted d
) as r
) as c
group by batch_id
having count(*) > 1
A last resort (and my original suggestion) is to try Binary_Checksum? As noted in the comment, this does open the risk for a rather high collision rate.
http://msdn.microsoft.com/en-us/library/ms173784.aspx
I have stolen the following example from lessthandot.com - link to the full SQL (and other cool functions) is below.
--Data Mismatch
SELECT 'Data Mismatch', t1.au_id
FROM( SELECT BINARY_CHECKSUM(*) AS CheckSum1 ,au_id FROM pubs..authors) t1
JOIN(SELECT BINARY_CHECKSUM(*) AS CheckSum2,au_id FROM tempdb..authors2) t2 ON t1.au_id =t2.au_id
WHERE CheckSum1 <> CheckSum2
Example taken from http://wiki.lessthandot.com/index.php/Ten_SQL_Server_Functions_That_You_Have_Ignored_Until_Now
I don't know if this is better than adding where my.A1 <> their.A1 and my.A2 <> their.A2 and my.A3 <> their.A3, but I would definitely give it a try (assuming SQL Server 2005+):
declare #again as bit;
set #again = 1;
declare #idlist table (Id int);
while #again = 1
begin
update top (10000) MyTableA
set my.A1 = their.A1, my.A2 = their.A2, my.A3 = their.A3
output inserted.Id into #idlist (Id)
from MyTableA my
join TheirTableA their on my.Id = their.Id
left join #idlist i on my.Id = i.Id
where i.Id is null
/* alternatively (instead of left join + where):
where not exists (select * from #idlist where Id = my.Id) */
if ##ROWCOUNT > 0
set #again = 1
else
set #again = 0
end
That is, declare a table variable for collecting the IDs of the rows being updated and use that table for looking up (and omitting) IDs that have already been updated.
A slight variation on the method would be to use a local temporary table instead of a table variable. That way you would be able to create an index on the ID lookup table, which might result in better performance.
If schema change is not possible. How about using trigger to save off the Ids that have changed. And only import/export those rows.
Or use trigger to export it immediately.

Resources