I'm attempting to convert the following SQL Server query into a GreenPlum version of the query:
INSERT INTO #TMP1 (part_id, file_id, location, measure_date)
SELECT DISTINCT
pt.part_id, qf.file_id, qf.edl_desc, pt.measure_date
FROM
part pt WITH (NOLOCK)
INNER JOIN
file_model qm with (nolock) on qm.file_model_id = pt.file_model_id
INNER JOIN
file qf with (nolock) on qf.file_id = qm.file_id;
INSERT INTO #part_list (file_id, part_id, measure_date)
SELECT DISTINCT
t1.file_id, k.part_id, k.measure_date
FROM
#TMP1 t1 WITH (NOLOCK)
CROSS APPLY
(SELECT DISTINCT TOP (300)
t2.part_id, t2.measure_date
FROM
#TMP1 t2 WITH (NOLOCK)
WHERE
t1.file_id = t2.file_id and t1.location = t2.location
ORDER BY
t2.measure_date DESC) k
WHERE
t1.measure_date >= dateadd(day, 30, getdate());
The idea here being that the final table contains the most recent up to 300 parts for all parts programs that are active (ie manufactured something) in the last 30 days.
Per the answers to this question, I am aware that LATERAL JOIN would do it, except my organization is using an older version of Postgres that does not have LATERAL, so I was left with implementing the following function instead:
CREATE FUNCTION BuildActiveParts(p_day INT, p_n INT)
RETURNS SETOF RECORD --TABLE (part_id bigint,file_id int, measure_date timestamp, location varchar(255))
AS $$
DECLARE
part_active RECORD;
part_list RECORD;
BEGIN
FOR part_active IN
SELECT DISTINCT qf.file_id, qf.location
FROM part pt
INNER JOIN file_model qm on qm.file_model_id = pt.file_model_id
INNER JOIN file qf on qf.file_id = qm.file_id WHERE pt.measure_date >= current_date - p_day LOOP
FOR part_list IN
SELECT DISTINCT pt.part_id, qf.file_id, pt.measure_date, qf.location
FROM part pt
INNER JOIN file_model qm on qm.file_model_id = pt.file_model_id
INNER JOIN file qf on qf.file_id = qm.file_id WHERE qf.file_id = part_active.file_id
AND qf.location = part_active.location
ORDER BY pt.measure_date DESC LIMIT p_n LOOP
RETURN NEXT part_list;
END LOOP;
END LOOP;
END
$$ LANGUAGE plpgsql;
-- Later used in:
--Build list of all active programs in last p_day days. This temporary table is a component of a larger function that produces a table based on this and other other calculations, called daily.
-- Note: this insert yields 'function cannot execute because it accesses relation'
INSERT INTO TMP_part_list ( part_id, file_id, measure_date, location)
SELECT DISTINCT * FROM BuildActiveParts(p_day, p_n) AS active_parts (part_id int, file_id text, measure_date timestamp, location text )
;
Unfortunately, this function is used in inserts to another table (an unavoidable reality of my business requirements), so while the function returns nice happy results when run in isolation, I get a big angry function cannot execute on segment because it accesses relation when I try to use it for its intended purpose. While I've seen suggestions to the effect of "make a VIEW instead", that's not really an option because a view resulting from the script this functionality is a part of would take too long to query.
What can I do, beyond embarking on a months-long excursion through a jungle of red tape to convince my organization to update their stuff, to resolve this?
Edit: Here are some attempts based on comments:
Attempt with function, did not work because of function cannot execute on segment because it accesses relation:
DROP FUNCTION IF EXISTS BuildRecentParts(TEXT, TEXT, INT);
CREATE FUNCTION BuildRecentParts(file_id TEXT, location_in TEXT, p_n INT)
RETURNS SETOF RECORD --TABLE (measure_date timestamp, part_id bigint)
AS $$
DECLARE
part_list RECORD;
BEGIN
FOR part_list IN
SELECT DISTINCT pt.measure_date, pt.part_id
FROM part pt
INNER JOIN file_model qm on qm.file_model_id = pt.file_model_id
INNER JOIN file qf on qf.file_id = qm.file_id
WHERE qf.file_id = file_id
AND qf.edl_desc = location_in
ORDER BY pt.measure_date DESC LIMIT p_n LOOP
RETURN NEXT part_list;
END LOOP;
END
$$ LANGUAGE plpgsql;
SELECT DISTINCT qf.file_id, qf.edl_desc, (SELECT pti.measure_date, pti.part_id FROM part pti
INNER JOIN file_model qmi on qmi.file_model_id = pti.file_model_id
INNER JOIN file qfi on qfi.file_id = qmi.file_id
WHERE qfi.file_id = qf.file_id
AND qfi.edl_desc = qf.edl_desc
ORDER BY pti.measure_date DESC LIMIT 300)
FROM part pt
INNER JOIN file_model qm on qm.file_model_id = pt.file_model_id
INNER JOIN file qf on qf.file_id = qm.file_id
WHERE pt.measure_date >= current_date - 30 ;
Attempt without function, will not work because subquery has multiple columns:
CREATE TEMPORARY TABLE TMP_TMP1 (part_id bigint, file_id varchar(255), location varchar(255), measure_date timestamp) DISTRIBUTED BY (part_id);
INSERT INTO TMP_TMP1 (part_id, file_id, location, measure_date)
SELECT DISTINCT pt.part_id, qf.file_id, qf.edl_desc, pt.measure_date
FROM part pt
INNER JOIN file_model qm on qm.file_model_id = pt.file_model_id
INNER JOIN file qf on qf.file_id = qm.file_id;
ANALYZE TMP_TMP1;
SELECT DISTINCT t1.file_id, t1.location, (SELECT t2.measure_date, t2.part_id FROM TMP_TMP1 t2
WHERE t2.file_id = t1.file_id
AND t2.location = t1.location
ORDER BY t2.measure_date DESC LIMIT 300)
FROM TMP_TMP1 t1
WHERE t1.measure_date >= current_date - 30;
I also attempted a recursive CTE, but found that that was unsupported.
Between answers here and from architects at my organization, we decided that we have struck a GreenPlum limitation that would be too costly to overcome, the logic that performs the Cross Join will be shifted to the R script that calls the stored procedure that this functionality would have been a part of.
Well, Greenplum doesn't have dirty reads so you can't implement the nolock hint you have. That is probably a good thing too. I would recommend removing that from SQL Server too.
I think the best solution is to use an Analytical function here instead of that function or even a correlated subquery which Greenplum supports. It is also more efficient in SQL Server to use this approach.
SELECT sub2.part_id, sub2.location, sub2.measure_date
FROM (
SELECT sub1.part_id, sub1.location, sub1.measure_date, row_number() over(partition by sub1.part_id order by sub1.measure_date desc) as rownum
FROM (
SELECT pt.part_id, qf.edl_desc as location, pt.measure_date
FROM part pt
INNER JOIN file_model qm on qm.file_model_id = pt.file_model_id
INNER JOIN file qf on qf.file_id = qm.file_id
WHERE pt.measure_date >= (now() - interval '30 days')
GROUP BY pt.part_id, qf.edl_desc, pt.measure_date
) AS sub1
) as sub2
WHERE sub2.rownum <= 300;
Now, I had to guess at your data because it looks like you could get into trouble with your original query if you have multiple qf.qcc_file_desc values because your original group by includes this. If you had multiple values, then things would get ugly.
I'm also not 100% sure on the row_number function without knowing your data. It might be this instead:
row_number() over(partition by sub1.part_id, sub1.location order by sub1.measure_date desc)
Related
I wrote the below query to pull the data from different databases. I have created two temp tables to pull the data from two different databases and finally a select statement from the original database to join all the tables. My query is getting executed but not getting any data.(Report is blank). I tried executing the two temp tables separately. it is giving the correct data. But when I execute the whole query, the result is blank. Below is the query. Please help.
"set fmtonly off
use GODSDB
IF object_id('tempdb..#CISIS_Call_Log') IS NOT NULL DROP TABLE #CISIS_Call_Log
select *
into #CISIS_Call_Log
from OPENQUERY (CSISDB,
'select
ccl.ContractOID,
ccl.db_insertdate,
ccl.ContractCallLogStatusIdentifier,
ccl.db_UpdateDate,
ccp.ContractCallLogPurposeOID,
ccp.ContractCallLogPurposeIdentifier,
ccp.Description
from csisdb.dbo.ContractCallLog CCL
inner join csisdb.dbo.ContractCallLogPurpose CCP on ccl.ContractCallLogPurposeIdentifier = ccp.ContractCallLogPurposeIdentifier
where JurisdictionShortIdentifier = ''ON''
AND ContractCallLogStatusIdentifier IN (''DNR'', ''NR'')
')
IF object_id('tempdb..#CMS_Campaign') IS NOT NULL DROP TABLE #CMS_Campaign
select *
into #CMS_Campaign
from OPENQUERY (BA_GBASSTOCMS, '
Select
SystemSourceIdentifier,
ContractOID,
OfferSentDate,
CampaignOfferTypeIdentifier,
CampaignContractStatusIdentifier,
CampaignContractStatusUpdateDate,
DeclineDate,
CampaignOfferOID,
CampaignOID,
CampaignStartDate,
CampaignEndDate,
Jurisdiction,
CampaignDescription
from CMS.dbo.vw_CampaignInfo
where Jurisdiction = ''ON''
and CampaignOfferTypeIdentifier = ''REN''
')
select mp.CommodityTypeIdentifier as Commodity
,c.RtlrContractIdentifier as ContractID
,cs.ContractStatusIdentifier as ContractStatus
,c.SigningDate
,cf.StartDate as FlowStartDate
,cf.EndDate as FlowEndDate
,datediff(day, getdate(), c.RenewalDate) as RemainingDays
,c.RenewalDate
,l.ContractCallLogStatusIdentifier as CallLogType
,Substring (l.Description, 1, 20) as CallPurpose
,l.db_insertDate as CallLogDate
,cms.CampaignOfferOID as OfferID
,cms.CampaignContractStatusIdentifier as OfferStatus
,cms.CampaignContractStatusUpdateDate as StatusChangeDate
,cms.DeclineDate
from Contract c
inner join contractstate cs on cs.contractoid = c.ContractOID
and cs.ContractStatusIdentifier in ('ERA', 'FLW')
and datediff(day, getdate(), c.RenewalDate) > 60
inner join SiteIdentification si on si.SiteOID = c.SiteOID
inner join MarketParticipant mp on mp.MarketParticipantOID = si.MarketParticipantOID
inner join Market m on m.MarketOID = mp.MarketOID
inner join Jurisdiction j on j.JurisdictionOID = m.JurisdictionOID
and j.CountryCode = 'CA'
and j.ProvinceOrStateCode = 'ON'
inner join ContractFlow cf on cf.ContractOID = c.ContractOID
inner join #CISIS_Call_Log l on convert(varchar(15), l.ContractOID) = c.RtlrContractIdentifier
inner join #CMS_Campaign cms on convert(varchar(15), cms.ContractOID) = c.RtlrContractIdentifier
set fmtonly on"
IF the data in each temp table is verified, then:
Try a smaller, less complex, query to test your temp tables with. Also try them using a LEFT join as well e.g.:
select
c.RtlrContractIdentifier as ContractID
, c.SigningDate
, datediff(day, getdate(), c.RenewalDate) as RemainingDays
, c.RenewalDate
, l.ContractCallLogStatusIdentifier as CallLogType
, Substring (l.Description, 1, 20) as CallPurpose
, l.db_insertDate as CallLogDate
, cms.CampaignOfferOID as OfferID
, cms.CampaignContractStatusIdentifier as OfferStatus
, cms.CampaignContractStatusUpdateDate as StatusChangeDate
, cms.DeclineDate
from Contract c
LEFT join #CISIS_Call_Log l on convert(varchar(15), l.ContractOID) = c.RtlrContractIdentifier
LEFT join #CMS_Campaign cms on convert(varchar(15), cms.ContractOID) = c.RtlrContractIdentifier
Does this return data? Does it return data from both joined tables?
If neither temp table is returning data then those join conditions need to be changed.
If both temp tables do return data from that query, then try INNER joins. If that still works, then add back more joins (one at a time) until you identify the join that causes the overall fault.
Without data for every table it just isn't possible for us to pinpoint the exact reason for a NULL result. Only you can, so you need to trouble-shoot the problem one step at a time.
I’m using SQL Server 2008
I have joins written something like the following, where the first join is encapsulated in a ‘With as’ statement so that I can name the output table as ‘A’ and then reference the ‘A’ resulting table in the next select and Join seen beneath it.
This works perfectly fine. What I would like to do then is reference that second table for another select statement and join, but when I try to wrap it in a ‘With as’ statement as well, the editor does not accept it as legitimate syntax for the second instance of 'With as'.
How can I subset resulting tables to reference in further select and join statements? I do not have permission to write to the database, so I can not create permanent tables in the database.
Thank you.
With A as
(
SELECT POL.[COMPANY_CODE]
,POL.[POLICY_NUMBER]
,POL.[STATUS_CODE]
,POL.ORIG_CLIENT_NUM
,TA.LINE
FROM [SamsReporting].[dbo].[POLICY] POL
Left join [SamsReporting].[dbo].[Transact] TA
ON TA.POLICY_NUMBER = POL.POLICY_NUMBER and TA.BASE_Account = 'B'
)
Select PM.POLICY_NUMBER
,A.[COMPANY_CODE]
,A.[POLICY_NUMBER]
,A.[Policy Status]
,eApp.SourceCode
From A
Left Join Web.dbo.Pmetrics PM on A.POLICY_NUMBER=PM.POLICY_NUMBER
Left Outer Join DDP.pol.eAppStaging eApp
on A.POLICY_NUMBER=eApp.PolicyNumber
where eApp.SourceCode = 'HAQ' or eApp.SourceCode = 'PLS'
Common Table Expressions (CTEs) can build upon each other as you would like. For example, you can do this:
WITH CTE1 AS (SELECT * FROM Table 1)
, CTE2 AS (SELECT * FROM CTE1)
, CTE3 AS (SELECT * FROM CTE2)
You only need the WITH statement for the first CTE. After that just use the CTE name, as in my example.
Hope that helps,
Ash
Sounds like a syntax issue to me. Google CTE (Common Table Expression) and review some examples of how they are formed.
With A as
(SELECT POL.[COMPANY_CODE]
,POL.[POLICY_NUMBER]
,POL.[STATUS_CODE]
,POL.ORIG_CLIENT_NUM
,TA.LINE
FROM [SamsReporting].[dbo].[POLICY] POL
Left join [SamsReporting].[dbo].[Transact] TA
ON TA.POLICY_NUMBER = POL.POLICY_NUMBER and TA.BASE_Account = 'B'),
B as (
Select PM.POLICY_NUMBER
,A.[COMPANY_CODE]
,A.[POLICY_NUMBER]
,A.[Policy Status]
,eApp.SourceCode
From A
Left Join Web.dbo.Pmetrics PM on A.POLICY_NUMBER=PM.POLICY_NUMBER
Left Outer Join DDP.pol.eAppStaging eApp
on A.POLICY_NUMBER=eApp.PolicyNumber
where eApp.SourceCode = 'HAQ' or eApp.SourceCode = 'PLS')
Select *
From B -- inner join some table
where some condition = 1
I am working on a project where I need to synchronize data from our system to an external system. What I want to achieve, is to periodically send only changed items (rows) from a custom query. This query looks like this (but with many more columns) :
SELECT T1.field1,
T1.field2,
T1.field2,
T1.field3,
CASE WHEN T1.field4 = 'some-value' THEN 1 ELSE 0 END,
T2.field1,
T3.field1,
T4.field1
FROM T1
INNER JOIN T2 ON T2.pk = T2.fk
INNER JOIN T3 ON T3.pk = T2.fk
INNER JOIN T4 ON T4.pk = T2.fk
I want to avoid to have to compare every field one to one between synchronizations. I came with the idea that I could generate a hash for every row from my query, and compare this with the hash from the previous synchronization, which will return only the changed rows. I am aware of the CHECKSUM function, but it is very collision-prone and might miss changes sometimes. However I like the way I could just make a temp table and use CHECKSUM(*), which makes maintenance easier (not having to add fields in the query and in the CHECKSUM) :
SELECT T1.field1,
T1.field2,
T1.field2,
T1.field3,
CASE WHEN T1.field4 = 'some-value' THEN 1 ELSE 0 END,
T2.field1,
T3.field1,
T4.field1
INTO #tmp
FROM T1
INNER JOIN T2 ON T2.pk = T2.fk
INNER JOIN T3 ON T3.pk = T2.fk
INNER JOIN T4 ON T4.pk = T2.fk;
-- get all columns from the query, plus a hash of the row
SELECT *, CHECKSUM(*)
FROM #tmp;
I am aware of HASHBYTES function (which supports sha1, md5, which are less prone to collisions), but it only accept varchar or varbinary, not a list of columns or * the way CHECKSUM does. Having to cast/convert every column from the query is a pain in the ... and opens the door to errors (forget to include a new field for instance)
I also noticed Change Data Capture and Change Tracking features of SQL Server, but they all seems complicated and overkill for what I am doing.
So my question : is there an other method to generate a hash from a query or a temp table that meets my criterias ?
If not, is there an other way to achieve this kind of work (to sync differences from a query)
I found a way to do exactly what I wanted, thanks to the FOR XML clause :
SELECT T1.field1,
T1.field2,
T1.field2,
T1.field3,
CASE WHEN T1.field4 = 'some-value' THEN 1 ELSE 0 END,
T2.field1,
T3.field1,
T4.field1
INTO #tmp
FROM T1
INNER JOIN T2 ON T2.pk = T2.fk
INNER JOIN T3 ON T3.pk = T2.fk
INNER JOIN T4 ON T4.pk = T2.fk;
-- get all columns from the query, plus a hash of the row (converted in an hex string)
SELECT T.*, CONVERT(VARCHAR(100), HASHBYTES('sha1', (SELECT T.* FOR XML RAW)), 2) AS sHash
FROM #tmp AS T;
I'm pulling my hair out over a subquery that I'm using to avoid about 100 duplicates (out of about 40k records). The records that are duplicated are showing up because they have 2 dates in h2.datecreated for a valid reason, so I can't just scrub the data.
I'm trying to get only the earliest date to return. The first subquery (that starts with "select distinct address_id", with the MIN) works fine on it's own...no duplicates are returned. So it would seem that the left join (or just plain join...I've tried that too) couldn't possibly see the second h2.datecreated, since it doesn't even show up in the subquery. But when I run the whole query, it's returning 2 values for some ipc.mfgid's, one with the h2.datecreated that I want, and the other one that I don't want.
I know it's got to be something really simple, or something that just isn't possible. It really seems like it should work! This is MSSQL. Thanks!
select distinct ipc.mfgid as IPC, h2.datecreated,
case when ad.Address is null
then ad.buildingname end as Address, cast(trace.name as varchar)
+ '-' + cast(trace.Number as varchar) as ONT,
c.ACCOUNT_Id,
case when h.datecreated is not null then h.datecreated
else h2.datecreated end as Install
from equipmentjoin as ipc
left join historyjoin as h on ipc.id = h.EQUIPMENT_Id
and h.type like 'add'
left join circuitjoin as c on ipc.ADDRESS_Id = c.ADDRESS_Id
and c.GRADE_Code like '%hpna%'
join (select distinct address_id, equipment_id,
min(datecreated) as datecreated, comment
from history where comment like 'MAC: 5%' group by equipment_id, address_id, comment)
as h2 on c.address_id = h2.address_id
left join (select car.id, infport.name, carport.number, car.PCIRCUITGROUP_Id
from circuit as car (NOLOCK)
join port as carport (NOLOCK) on car.id = carport.CIRCUIT_Id
and carport.name like 'lead%'
and car.GRADE_Id = 29
join circuit as inf (NOLOCK) on car.CCIRCUITGROUP_Id = inf.PCIRCUITGROUP_Id
join port as infport (NOLOCK) on inf.id = infport.CIRCUIT_Id
and infport.name like '%olt%' )
as trace on c.ccircuitgroup_id = trace.pcircuitgroup_id
join addressjoin as ad (NOLOCK) on ipc.address_id = ad.id
The typical approach to only getting the lowest row is one of the following. You didn't bother to specify what version of SQL Server you're using, what you want to do with ties, and I have little interest to try to work this into your complex query, so I'll show you an abstract simplification for different versions.
SQL Server 2000
SELECT x.grouping_column, x.min_column, x.other_columns ...
FROM dbo.foo AS x
INNER JOIN
(
SELECT grouping_column, min_column = MIN(min_column)
FROM dbo.foo GROUP BY grouping_column
) AS y
ON x.grouping_column = y.grouping_column
AND x.min_column = y.min_column;
SQL Server 2005+
;WITH x AS
(
SELECT grouping_column, min_column, other_columns,
rn = ROW_NUMBER() OVER (ORDER BY min_column)
FROM dbo.foo
)
SELECT grouping_column, min_column, other_columns
FROM x
WHERE rn = 1;
This subqery:
select distinct address_id, equipment_id,
min(datecreated) as datecreated, comment
from history where comment like 'MAC: 5%' group by equipment_id, address_id, comment
Probably will return multiple rows because the comment is not guaranteed to be the same.
Try this instead:
CROSS APPLY (
SELECT TOP 1 H2.DateCreated, H2.Comment -- H2.Equipment_id wasn't used
FROM History H2
WHERE
H2.Comment LIKE 'MAC: 5%'
AND C.Address_ID = H2.Address_ID
ORDER BY DateCreated
) H2
Switch that to OUTER APPLY in case you want rows that don't have a matching desired history entry.
I tried everything but I couldn't overcome this problem.
I have a table-valued function.
When I call this function with
SELECT * FROM Ratings o1
CROSS APPLY dbo.FN_RatingSimilarity(50, 497664, 'Cosine') o2
WHERE o1.trackId = 497664
It takes a while to be executed. But when I do this.
SELECT * FROM Ratings o1
CROSS APPLY dbo.FN_RatingSimilarity(50, o1.trackId, 'Cosine') o2
WHERE o1.trackId = 497664
It is executed in 32 seconds. I created all indexes but It didn't help.
My function by the way:
ALTER FUNCTION [dbo].[FN_RatingSimilarity]
(
#trackId INT,
#nTrackId INT,
#measureType VARCHAR(100)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT o2.id,
o2.name,
o2.releaseDate,
o2.numberOfRatings,
o2.averageRating,
COUNT(1) as numberOfSharedUsers,
CASE #measureType
WHEN 'Cosine' THEN SUM(o3.score*o4.score)/(0.01+SQRT(SUM(POWER(o3.score,2))) * SQRT(SUM(POWER(o4.score,2))))
WHEN 'AdjustedCosine' THEN SUM((o3.score-o5.averageRating)*(o4.score-o5.averageRating))/(0.01+SQRT(SUM(POWER(o3.score-o5.averageRating, 2)))*SQRT(SUM(POWER(o4.score-o5.averageRating, 2))))
WHEN 'Pearson' THEN SUM((o3.score-o1.averageRating)*(o4.score-o2.averageRating))/(0.01+SQRT(SUM(POWER(o3.score-o1.averageRating, 2)))*SQRT(SUM(POWER(o4.score-o2.averageRating, 2))))
END as similarityRatio
FROM dbo.Tracks o1
INNER JOIN dbo.Tracks o2 ON o2.id != #trackId
INNER JOIN dbo.Ratings o3 ON o3.trackId = o1.id
INNER JOIN dbo.Ratings o4 ON o4.trackId = o2.id AND o4.userId = o3.userId
INNER JOIN dbo.Users o5 ON o5.id = o4.userId
WHERE o1.id = #trackId
AND o2.id = ISNULL(#nTrackId, o2.id)
GROUP BY o2.id,
o2.name,
o2.releaseDate,
o2.numberOfRatings,
o2.averageRating
)
Any help would be appreciated.
Thanks.
Emrah
I believe that your bottleneck is the calculations + your very expensive inner joins.
The way your are joining is basically creating a cross join - It is returning a result set with all ther records linked to all other records, Except the one for which the id is supplied. Then you go and add to that result set with the other inner joins.
For every inner join, SQL goes and creates a result set with all the rows matching.
So the first thing you do in your query is to tell SQL to basically do a cross join on the same table. (I am assuming you are still following, that looks pretty advanced so I'll just take you are familiar with advanced SQL syntax and operators)
Now in the next inner join, you are applying the Results table to your newly created huge result set, and only then filtering out the ones not both tables.
So as a start, see if you can't do your joins the other way around. (This really depends on your table record count and record sizes). Try to have the smallest result sets first and then join onto that.
The second thing you might want to try is to firstly limit your result set even before the joins.So start with a CTE where you filter for o1.id = #trackId. Then select * from this CTE , do your joins on the CTE and then filter in your query for o2.id = ISNULL(#nTrackId, o2.id)
I will work on an example, stay tuned...
--
Ok, I added an example, did a quick test and the values returned are the same. Run this through your data and let us know if there is any improvement. (Note, this does not address the INNER JOIN order point discussed, still do play around with that.)
Example:
ALTER FUNCTION [dbo].[FN_RatingSimilarity_NEW]
(
#trackId INT,
#nTrackId INT,
#measureType VARCHAR(100)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
WITH CTE_ALL AS
(
SELECT id,
name,
releaseDate,
numberOfRatings,
averageRating
FROM dbo.Tracks
WHERE id = #trackId
)
SELECT o2.id,
o2.name,
o2.releaseDate,
o2.numberOfRatings,
o2.averageRating,
COUNT(1) as numberOfSharedUsers,
CASE #measureType
WHEN 'Cosine' THEN SUM(o3.score*o4.score)/(0.01+SQRT(SUM(POWER(o3.score,2))) * SQRT(SUM(POWER(o4.score,2))))
WHEN 'AdjustedCosine' THEN SUM((o3.score-o5.averageRating)*(o4.score-o5.averageRating))/(0.01+SQRT(SUM(POWER(o3.score-o5.averageRating, 2)))*SQRT(SUM(POWER(o4.score-o5.averageRating, 2))))
WHEN 'Pearson' THEN SUM((o3.score-o1.averageRating)*(o4.score-o2.averageRating))/(0.01+SQRT(SUM(POWER(o3.score-o1.averageRating, 2)))*SQRT(SUM(POWER(o4.score-o2.averageRating, 2))))
END as similarityRatio
FROM CTE_ALL o1
INNER JOIN dbo.Tracks o2 ON o2.id != #trackId
INNER JOIN dbo.Ratings o3 ON o3.trackId = o1.id
INNER JOIN dbo.Ratings o4 ON o4.trackId = o2.id AND o4.userId = o3.userId
INNER JOIN dbo.Users o5 ON o5.id = o4.userId
WHERE o2.id = ISNULL(#nTrackId, o2.id)
GROUP BY o2.id,
o2.name,
o2.releaseDate,
o2.numberOfRatings,
o2.averageRating
)