Read flat file having array in Azure Data Factory - arrays

I need to read a pipe delimited file where we have an array repeating 30 times. I need to access these array of elements and change the sequence and send in the output file.
E.g.
Tanya|1|Pen|2|Book|3|Eraser
Raj|11|Eraser|22|Bottle
In the above example, first field is the Customer name. After that we have an array of items ordered - Order ID and Item name.
Could you please suggest how to read these array elements individually to process these further?

You will be using copy activity in Azure Data Factory pipeline in which the source will be DelimitedText dataset and sink will be JSON dataset. If your source and destination files are located in Azure Blob Storage, create a Linked Service to connect the files with Azure Data Factory.
The source file dataset properties will look like this. You need to select Column Delimiter as Pipe (|).
The sink file with JSON type dataset settings will look like as shown in below. Just mention the output location path where the file will be saved after conversion.
In copy data activity sink tab, select the File pattern as Array of files. Trigger the pipeline.
The sample input and output shown below.

If you can control the output file format then you're better off using two (or three) delimiters, one for the person and the other for the order items. If you can get a third then use that to split the order item from the order line.
Assuming the data format:
Person Name | 1:M [Order]
And each Order is
order line | item name
You can simply ingest the entire row into a single nvarchar(max) column and then use SQL to break out the data you need.
The following is one such example.
declare #tbl table (d nvarchar(max));
insert into #tbl values('Tanya|1|Pen|2|Book|3|Eraser'),('Raj|11|Eraser|22|Bottle');
declare #base table (id int, person varchar(100),total_orders int, raw_orders varchar(max));
declare #output table (id int, person varchar(100),item_id int, item varchar(100));
with a as
(
select
CHARINDEX('|',d) idx
,d
,ROW_NUMBER() over (order by d) as id /*Or newid()*/
from #tbl
), b as
(
select
id
,SUBSTRING(d,0,idx) person
,SUBSTRING(d,idx+1,LEN(d)-idx+1) order_array
from a
), c as
(
select id, person, order_array
,(select count(1) from string_split(order_array,'|')) /2 orders
from b
)
insert into #base (id,person,total_orders,raw_orders)
select id,person,orders,order_array from c
declare #total_persons int = (select count(1) from #base);
declare #person_enu int = 1;
while #person_enu <= #total_persons
BEGIN
declare #total_orders int = (select total_orders from #base where id = #person_enu);
declare #raw_orders nvarchar(max) = (select raw_orders from #base where id = #person_enu);
declare #order_enu int = 1;
declare #i int = 1;
print CONCAT('Person ', #person_enu, '. Total orders: ', #total_orders);
while #order_enu <= #total_orders
begin
--declare #id int = (select value from string_split(#raw_orders,'|',1) where ordinal = #i);
--declare #val varchar(100) = (select value from string_split(#raw_orders,'|',1) where ordinal = #i+1);
--print concat('Will process order ',#order_enu);
--print concat('ID:',#i, ' Value:', #i+1)
--print concat('ID:',#id, ' Value:', #val)
INSERT INTO #output (id,person,item_id,item)
select b.id,b.person,n.value [item_id], v.value [item] from #base b
cross apply string_split(b.raw_orders,'|',1) n
cross apply string_split(b.raw_orders,'|',1) v
where b.id = #person_enu and n.ordinal = #i and v.ordinal = #i+1;
set #order_enu +=1;
set #i+=2;
end
set #person_enu += 1;
END
select * from #output;

Related

Searching for multiple patterns in a string in T-SQL

In t-sql my dilemma is that I have to parse a potentially long string (up to 500 characters) for any of over 230 possible values and remove them from the string for reporting purposes. These values are a column in another table and they're all upper case and 4 characters long with the exception of two that are 5 characters long.
Examples of these values are:
USFRI
PROME
AZCH
TXJS
NYDS
XVIV. . . . .
Example of string before:
"Offered to XVIV and USFRI as back ups. No response as of yet."
Example of string after:
"Offered to and as back ups. No response as of yet."
Pretty sure it will have to be a UDF but I'm unable to come up with anything other than stripping ALL the upper case characters out of the string with PATINDEX which is not the objective.
This is unavoidably cludgy but one way is to split your string into rows, once you have a set of words the rest is easy; Simply re-aggregate while ignoring the matching values*:
with t as (
select 'Offered to XVIV and USFRI as back ups. No response as of yet.' s
union select 'Another row AZCH and TXJS words.'
), v as (
select * from (values('USFRI'),('PROME'),('AZCH'),('TXJS'),('NYDS'),('XVIV'))v(v)
)
select t.s OriginalString, s.Removed
from t
cross apply (
select String_Agg(j.[value], ' ') within group(order by Convert(tinyint,j.[key])) Removed
from OpenJson(Concat('["',replace(s, ' ', '","'),'"]')) j
where not exists (select * from v where v.v = j.[value])
)s;
* Requires a fully-supported version of SQL Server.
build a function to do the cleaning of one sentence, then call that function from your query, something like this SELECT Col1, dbo.fn_ReplaceValue(Col1) AS cleanValue, * FROM MySentencesTable. Your fn_ReplaceValue will be something like the code below, you could also create the table variable outside the function and pass it as parameter to speed up the process, but this way is all self contained.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION fn_ReplaceValue(#sentence VARCHAR(500))
RETURNS VARCHAR(500)
AS
BEGIN
DECLARE #ResultVar VARCHAR(500)
DECLARE #allValues TABLE (rowID int, sValues VARCHAR(15))
DECLARE #id INT = 0
DECLARE #ReplaceVal VARCHAR(10)
DECLARE #numberOfValues INT = (SELECT COUNT(*) FROM MyValuesTable)
--Populate table variable with all values
INSERT #allValues
SELECT ROW_NUMBER() OVER(ORDER BY MyValuesCol) AS rowID, MyValuesCol
FROM MyValuesTable
SET #ResultVar = #sentence
WHILE (#id <= #numberOfValues)
BEGIN
SET #id = #id + 1
SET #ReplaceVal = (SELECT sValue FROM #allValues WHERE rowID = #id)
SET #ResultVar = REPLACE(#ResultVar, #ReplaceVal, SPACE(0))
END
RETURN #ResultVar
END
GO
I suggest creating a table (either temporary or permanent), and loading these 230 string values into this table. Then use it in the following delete:
DELETE
FROM yourTable
WHERE col IN (SELECT col FROM tempTable);
If you just want to view your data sans these values, then use:
SELECT *
FROM yourTable
WHERE col NOT IN (SELECT col FROM tempTable);

How to pass string with single quotations to in()

I am working on a query that I need to modify so that a string is passed to in(). The view table is being used by some other view table and ultimately by a stored procedure. The string values must be in ' '.
select region, county, name
from vw_main
where state - 'MD'
and building_id in ('101', '102') -- pass the string into in()
The values for the building_id will be entered at the stored procedure level upon its execution.
Please check below scripts which will give you answer.
Way 1: Split CSV value using XML and directly use select query in where condition
DECLARE #StrBuildingIDs VARCHAR(1000)
SET #StrBuildingIDs = '101,102'
SELECT
vm.region,
vm.county,
vm.name
FROM vw_main vm
WHERE vm.state = 'MD'
AND vm.building_id IN
(
SELECT
l.value('.','VARCHAR(20)') AS Building_Id
FROM
(
SELECT CAST('<a>' + REPLACE(#StrBuildingIDs,',','</a><a>') + '</a>') AS BuildIDXML
) x
CROSS APPLY x.BuildIDXML.nodes('a') Split(l)
)
Way 2: Split CSV value using XML, Create Variable Table and use that in where condition
DECLARE #StrBuildingIDs VARCHAR(1000)
SET #StrBuildingIDs = '101,102'
DECLARE #TblBuildingID TABLE(BuildingId INT)
INSERT INTO #TblBuildingID(BuildingId)
SELECT
l.value('.','VARCHAR(20)') AS Building_Id
FROM
(
SELECT CAST('<a>' + REPLACE(#StrBuildingIDs,',','</a><a>') + '</a>') AS BuildIDXML
) x
CROSS APPLY x.BuildIDXML.nodes('a') Split(l)
SELECT
vm.region,
vm.county,
vm.name
FROM vw_main AS vm
WHERE vm.state = 'MD'
AND vm.building_id IN
(
SELECT
BuildingId
FROM #TblBuildingID
)
Way 3: Split CSV value using XML, Create Variable Table and use that in INNER JOIN
Assuming the input string is not end-user input, you can do this. That is, derived or pulled from another table or other controlled source.
DECLARE #in nvarchar(some length) = N'''a'',''b'',''c'''
declare #stmt nvarchar(4000) = N'
select region, county, name
from vw_main
where state = ''MD''
and building_id in ({instr})'
set #stmt = replace(#stmt, N'{instr}', #instr)
exec sp_executesql #stmt=#stmt;
If the input is from an end-user, this is safer:
declare # table (a int, b char)
insert into #(a, b) values (1,'A'), (2, 'B')
declare #str varchar(50) = 'A,B'
select t.* from # t
join (select * from string_split(#str, ',')) s(b)
on t.b = s.b
You may like it better anyway, since there's no dynamic sql involved. However you must be running SQL Server 2016 or higher.

How to get 2 sub strings as 2 columns from single column of table

[{"key":"Mobile","value":"9100617634"},{"key":"Email","value":"balajirao1#ziaff.in"}]
The above one represents the value in one column of a table.
I want 2 columns named as mobile and email where those having values as 9000617634,balajirao#ziraff, respectively .
How to get in sql server.
see the below one.
mobile Email
------ ------------
9100617634 balajirao1#ziaff.in
U can Use Pivot Function to convert the row values to column,
Below is the Static Code for implementation
SELECT * FROM(
SELECT
columns_name
FROM
table_name
) M
PIVOT (MAX(Key) FOR table_name IN (mobile ,Email))AS P ;
Well that was fun doing it the hard way :)
--initial string
declare #s1 varchar(1000) = (select '[{"key":"Mobile","value":"9100617634"},{"key":"Email","value":"balajirao1#ziaff.in"}]')
--selecting first part and second part of string
declare #mobile varchar(100) = (select right(left(#s1,charindex('}',#s1)-1),len(left(#s1,charindex('}',#s1))) - charindex(':',left(#s1,charindex('}',#s1)))-1))
declare #mail varchar(100) = (select right(#s1,charindex(':',reverse(#s1))-1))
--getting rid of extra characters
set #mobile = (right(#mobile, len(#mobile) - charindex(':',#mobile)))
set #mail = (left(#mail, len(#mail) - charindex('}',reverse(#mail))))
--getting rid of double quotes
set #mobile = replace(#mobile,'"','')
set #mail = replace(#mail,'"','')
--selecting data
select
#mobile as Mobile,
#mail as Mail
Result is as following:
Mobile Mail
9100617634 balajirao1#ziaff.in
SELECT
case when [MOBILE/PHONE] like '[a-z]%' then null else [MOBILE/PHONE] end as [MOBILE/PHONE],
case when ISNUMERIC(email)=0 then EMAIL end as EMAIL
FROM
(
select
REPLACE(SUBSTRING(RIGHT(right(left(AC.Communication,charindex('}',AC.Communication)),len(left(AC.Communication,charindex('}',AC.Communication))) - charindex(':',left(AC.Communication,charindex('}',AC.Communication)))),LEN(right(left(AC.Communication,charindex('}',AC.Communication)),len(left(AC.Communication,charindex('}',AC.Communication))) - charindex(':',left(AC.Communication,charindex('}',AC.Communication)))))-CHARINDEX(':',right(left(AC.Communication,charindex('}',AC.Communication)),len(left(AC.Communication,charindex('}',AC.Communication))) - charindex(':',left(AC.Communication,charindex('}',AC.Communication)))))),2,CHARINDEX(']',AC.Communication)),'"}','') AS [MOBILE/PHONE],
REPLACE(REPLACE(LEFT(right(AC.Communication,charindex(':',reverse(AC.Communication))),LEN(right(AC.Communication,charindex(':',reverse(AC.Communication))))-charindex('}',reverse(right(AC.Communication,charindex(':',reverse(AC.Communication)))))),':"',' '),'"','') AS EMAIL
from DimAccountContact as AC

With SQL Server, How can I query a table based on a delimited string as the criteria?

I have the following tables:
tbl_File:
FileID | Filename
-----------------
1 | test.jpg
and
tbl_Tag:
TagID | TagName
---------------
1 | Red
and
tbl_TagFile:
ID | TagID | FileID
-------------------
1 | 1 | 1
I need to pass a non-inclusive query against these tables. For example, imagine a list of checkboxes to select one or more tags, and then a search button. I need to pass the TagID's to the query as a PIPE delimited string, such as "1|2|5|"
The search results need to be non-inclusive, such as if it must meet all the criteria. If 3 tags are selected, the results are to be files that have all 3 tags associated with them.
I think I've made this too complicated, but tried iterating over the tags using charindex and stuff to work my way through the string, but it seems there must be an easier way.
I'd like to do this as a function... Such as
SELECT FileID, Filename
FROM tbl_Files
WHERE dbo.udf_FileExistswithTags(#Tags, FileID) = 1
Any efficient way to do this?
It doesn't sound from your example scenario that the actual "need" is to pass a pipe-delimited string. I would highly suggest abandoning that idea and using a Table Value Parameter in your stored procedure. This has numerous advantages in that you will not hit a datatype limit or a "number of parameters" limit that might occur with very large sets of criteria. Additionally it gets away from any need to run a (potentially very slow) UDF.
Split the string into tokens on the application side, and then insert each token as a row in the TVP. Example below:
Create the TVP type in your database:
CREATE TYPE [dbo].[FileNameType] AS TABLE
(
fileName varchar(1000)
)
On the application side, build your list of filename tokens into a recordset:
private static List<SqlDataRecord> BuildFileNameTokenRecords(IEnumerable<string> tokens)
{
var records = new List<SqlDataRecord>();
foreach (string token in tokens){
var record = new SqlDataRecord(
new SqlMetaData[]
{
new SqlMetaData("fileName", SqlDbType.Varchar),
}
);
records.Add(record);
}
return records;
}
Wherever you run your proc from (rough code here):
var records = BuildFileNameTokenRecords(listofstrings);
var sqlCmd = sqlDb.GetStoredProcCommand("FileExists");
sqlDb.AddInParameter(sqlCmd, "tvpFilenameTokens", SqlDbType.Structured, records);
ExecuteNonQuery(sqlCmd);
Filtering your select statement then simply becomes a matter of joining on the tokens in the table parameter. Something like this:
CREATE PROCEDURE dbo.FileExists
(
-- Put additional parameters here
#tvpFilenameTokens dbo.FileNameType READONLY,
)
AS
BEGIN
SELECT FileID, Filename
FROM tbl_Files INNER JOIN #tvpFilenameTokens
ON tbl_Files.FileID = #tvpFilenameTokens.fileName
END
Here is an option that should scale. All of the functionality is available back to SQL Server 2005. It uses a CTE to separate the portion of the query that finds only the FileIDs that have all of the TagIDs passed in, and then that list of FileIDs is joined to the [File] table to get the details. It also uses an INNER JOIN instead of an IN list to match the TagID's.
Please note that the example below uses a SQLCLR splitter that is freely available in the SQL# library (which I wrote, but this function is in the Free version). The specific splitter used is not the important part; it should just be one that is either SQLCLR, an inline tally-table (like the one used in #wewesthemenace's answer), or is the XML method. Just don't use a splitter based on a WHILE-loop or a recursive CTE.
---- TEST SETUP
DECLARE #File TABLE
(
FileID INT NOT NULL PRIMARY KEY,
[Filename] NVARCHAR(200) NOT NULL
);
DECLARE #TagFile TABLE
(
TagID INT NOT NULL,
FileID INT NOT NULL,
PRIMARY KEY (TagID, FileID)
);
INSERT INTO #File VALUES (1, 'File1.txt');
INSERT INTO #File VALUES (2, 'File2.txt');
INSERT INTO #File VALUES (3, 'File3.txt');
INSERT INTO #TagFile VALUES (1, 1);
INSERT INTO #TagFile VALUES (2, 1);
INSERT INTO #TagFile VALUES (5, 1);
INSERT INTO #TagFile VALUES (1, 2);
INSERT INTO #TagFile VALUES (2, 2);
INSERT INTO #TagFile VALUES (4, 2);
INSERT INTO #TagFile VALUES (1, 3);
INSERT INTO #TagFile VALUES (2, 3);
INSERT INTO #TagFile VALUES (5, 3);
INSERT INTO #TagFile VALUES (6, 3);
---- DONE WITH TEST SETUP
DECLARE #TagsToGet VARCHAR(100); -- this would be the proc input parameter
SET #TagsToGet = '1|2|5';
CREATE TABLE #Tags (TagID INT NOT NULL PRIMARY KEY);
DECLARE #NumTags INT;
INSERT INTO #Tags (TagID)
SELECT split.SplitVal
FROM SQL#.String_Split4k(#TagsToGet, '|', 1) split;
SET #NumTags = ##ROWCOUNT;
;WITH files AS
(
SELECT tf.FileID
FROM #TagFile tf
INNER JOIN #Tags tg
ON tg.TagID = tf.TagID
GROUP BY tf.FileID
HAVING COUNT(*) = #NumTags
)
SELECT fl.*
FROM #File fl
INNER JOIN files
ON files.FileID = fl.FileID
ORDER BY fl.[Filename] ASC;
DROP TABLE #Tags; -- don't need this if code above is placed in a proc
Results:
FileID Filename
1 File1.txt
3 File3.txt
Notes
As much as I love TVPs (and I do, when they are done correctly and used appropriately), I would say that they are a bit much for this type of small scale, single dimensional array scenario. There won't really be any performance gain over using a SQLCLR streaming TVF string splitter but it would require more app code and the additional User-Defined Table Type, which can't be updated without first dropping all procs that reference it. That doesn't happen all of the time, but needs to be considered in terms of long-term maintenance costs.
The JOIN between TagFile and the temporary table populated from the split operation should be much more efficient than using an IN list with a subquery for the split operation. An IN list is short-hand for all of the values in it to be their own OR conditions. Hence the JOIN is a fully set-based approach that lets the Query Optimizer do its thang.
The structure I used for the test #TagFile table only has the two relevant IDs in it: TagID and FileID. It does not have the ID field that I assume is an IDENTITY field on this table. Unless there is a very specific reason for needing that IDENTITY field, I would suggest removing it. It adds to inherent benefit as the combination of TagID and FileID is a natural key (i.e. it is both NOT NULL and Unique). And if the Clustered PK of this table were simply those two fields, the JOIN to the temp table of those split-out TagIDs would be quite fast, even with millions of rows in TagFile.
One reason that this approach works so much better than trying to handle this via a function per FileID (outside of the obvious set-based is better than cursor-based reason) is that the list of TagIDs is the same for all files to be checked. So splitting that out more than one time is a waste of effort.
By not splitting the TagID list inline in the query I am able to capture the number of elements in that list with no additional effort. Hence this saves from needing to do a secondary calculation.
Here is a function called DelimitedSplit8K by Jeff Moden. This is used to split strings of length up to 8000. For more info, read this: http://www.sqlservercentral.com/articles/Tally+Table/72993/
CREATE FUNCTION [dbo].[DelimitedSplit8K](
#pString VARCHAR(8000), --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
#pDelimiter CHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (--10E+1 or 10 rows
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString, t.N, 1) = #pDelimiter
),
cteLen(N1, L1) AS(--==== Return start and length (for use in substring)
SELECT
s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter, #pString, s.N1), 0) - s.N1, 8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
Your query would now be:
DECLARE #pString VARCHAR(8000) = '1|3|5'
SELECT
f.*
FROM tbl_File f
INNER JOIN tbl_TagFile tf ON tf.FileID = f.FileID
WHERE
tf.TagID IN(SELECT CAST(item AS INT) FROM dbo.DelimitedSplit8K(#pString, '|'))
GROUP BY f.FileID, f.FileName
HAVING COUNT(tf.ID) = (LEN(#pString) - LEN(REPLACE(#pString,'|','')) + 1)
The statement below counts the number of TagID in the parameter by counting the occurrence of the delimiter | + 1.
(LEN(#pString) - LEN(REPLACE(#pString,'|','')) + 1)
Here is an option that does not require UDF's.
It can be argued that this is also complicated.
DECLARE #TagList VARCHAR(50)
-- pass in this
SET #TagList = '1|3|6'
SELECT
FinalSet.FileID,
FinalSet.Tag,
FinalSet.TotalMatches
FROM
(
SELECT
tbl_TagFile.FileID,
tbl_TagFile.Tag,
COUNT(*) OVER(PARTITION BY tbl_TagFile.FileID) TotalMatches
FROM
(
SELECT 1 FileID, '1' Tag UNION ALL
SELECT 1 , '2' UNION ALL
SELECT 1 , '3' UNION ALL
SELECT 1 , '6' UNION ALL
SELECT 2 , '1' UNION ALL
SELECT 2 , '3'
) tbl_TagFile
INNER JOIN
(
SELECT tbl_Tag.Tag
FROM
(
SELECT '1' Tag UNION ALL
SELECT '2' UNION ALL
SELECT '3' UNION ALL
SELECT '4' UNION ALL
SELECT '5' UNION ALL
SELECT '6'
) tbl_Tag
WHERE '|' + #TagList + '|' LIKE '%|' + Tag + '|%'
) LimitedTagTable
ON LimitedTagTable.Tag = tbl_TagFile.Tag
) FinalSet
WHERE
FinalSet.TotalMatches = (LEN(#TagList) - LEN(REPLACE(#TagList,'|','')) + 1)
There's some complications in this around data types and indexes and stuff but you can see the concept - you are only getting the records that match your passed in string.
subtable LimitedTagTable is your tag list filtered by your input pipe delimited string
subtable FinalSet joins your limited tag list to your list of files
column TotalMatches works out how many tag matches your file had
Finally this line limits the output to those files that had enough matches:
FinalSet.TotalMatches = (LEN(#TagList) - LEN(REPLACE(#TagList,'|','')) + 1)
Please experiment with different inputs and datasets and see if it suits as I have made a number of assumptions.
I'm answering my own question, in hopes that someone can let me know if/how flawed it is. So far it seems to be working but just early testing.
Function:
ALTER FUNCTION [dbo].[udf_FileExistsByTags]
(
#FileID int
,#Tags nvarchar(max)
)
RETURNS bit
AS
BEGIN
DECLARE #Exists bit = 0
DECLARE #Count int = 0
DECLARE #TagTable TABLE ( FileID int, TagID int )
DECLARE #Tag int
WHILE len(#Tags) > 0
BEGIN
SET #Tag = CAST(LEFT(#Tags, charindex('|', #Tags + '|') -1) as int)
SET #Count = #Count + 1
IF EXISTS (SELECT * FROM tbl_FileTag WHERE FileID = #FileID AND TagID = #Tag )
BEGIN
INSERT INTO #TagTable ( FileID, TagID ) VALUES ( #FileID, #Tag )
END
SET #Tags = STUFF(#Tags, 1, charindex('|', #Tags + '|'), '')
END
SET #Exists = CASE WHEN #Count = (SELECT COUNT(*) FROM #TagTable) THEN 1 ELSE 0 END
RETURN #Exists
END
Then in the query:
SELECT * FROM tbl_File a WHERE dbo.udf_FileExistsByTags(a.FileID, #Tags) = 1
So now I'm looking for errors.
What do you think? Probably not every efficient, however this search will be used only on a periodic basis.

Execute multiple dynamic T-SQL statements and obtain a limited number of unique values while preserving order

I have a SourceTable and a table variable #TQueries containing various T-SQL predicates that target SourceTable.
The expected result is to dynamically generate SELECT statements that return a list of Id's as specified by the predicates in #TQueries. Each dynamically generated SELECT statement also needs to execute in a particular order, and the final set of values needs to be unique and the ordering must be preserved.
Fortunately, there's a limit to how many values need to be retrieved and how many dynamic queries need to be generated. The Id list should contain at most 10 Ids, and we don't expect more than 7 queries.
The following is a sample of this setup, not the actual data/database:
-- Set up some test data, this is quick and dirty just to provide some data to test against
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[SourceTable]') AND type in (N'U'))
BEGIN
-- Create a numbers table, sorta
SELECT TOP 20
IDENTITY(INT,1,1) AS Id,
ABS(CHECKSUM(NewId())) % 100 AS [SomeValue]
INTO [SourceTable]
FROM sysobjects a
END
DECLARE #TQueries TABLE (
[Ordinal] INT,
[WherePredicate] NVARCHAR(MAX),
[OrderByPredicate] NVARCHAR(MAX)
);
-- Simulate SELECTs with different order by that get different data due to varying WHERE clauses and ORDER conditions
INSERT INTO #TQueries VALUES ( 1, N'[Id] IN (6,11,13,7,10,3,15)', '[SomeValue] ASC' ) -- Sort Asc
INSERT INTO #TQueries VALUES ( 2, N'[Id] IN (9,15,14,20,17)', '[SomeValue] DESC' ) -- Sort Desc
INSERT INTO #TQueries VALUES ( 3, N'[Id] IN (20,10,1,16,11,19,9,15,17,6,2,3,13)', 'NEWID()' ) -- Sort Random
My main issue has been avoiding the use of a CURSOR or iterating through the rows one by one. The closest I've come to a set operation that meets this criteria is using a table variable to store the results of each query or a massive CTE.
Suggestions and comments are welcome.
Here's a solution that builds a single statement both to run all the queries and to return the results.
It uses a similar approach as in your answer when iterating over the #TQueries table, i.e. it also uses {...} tokens where column values from #TQuery should go, and it puts the values there with nested REPLACE() calls.
Other than that, it heavily depends on ranking functions, and I'm not sure if doesn't really abuse them. You'd need to test this method before deciding if it's better or worse than the one you've got so far.
DECLARE #QueryTemplate nvarchar(max), #FinalSQL nvarchar(max);
SET #QueryTemplate =
N'SELECT
[Id],
QueryRank = {Ordinal},
RowRank = ROW_NUMBER() OVER (ORDER BY {OrderByPredicate})
FROM [dbo].[SourceTable]
WHERE {WherePredicate}
';
SET #FinalSQL =
N'WITH AllData AS (
' +
SUBSTRING(
(
SELECT
'UNION ALL ' +
REPLACE(REPLACE(REPLACE(#QueryTemplate,
'{Ordinal}' , [Ordinal] ),
'{OrderByPredicate}', [OrderByPredicate]),
'{WherePredicate}' , [WherePredicate] )
FROM #TQueries
ORDER BY [Ordinal]
FOR XML PATH (''), TYPE
).value('.', 'nvarchar(max)'),
11, -- starting just after the first 'UNION ALL '
CAST(0x7FFFFFFF AS int) -- max int; no need to specify the exact length
) +
'),
RankedData AS (
SELECT
[Id],
QueryRank,
RowRank,
ValueRank = ROW_NUMBER() OVER (PARTITION BY [Id] ORDER BY QueryRank)
FROM AllData
)SELECT TOP (#top)
[Id]
FROM RankedData
WHERE ValueRank = 1
ORDER BY
QueryRank,
RowRank
';
PRINT #FinalSQL;
EXECUTE sp_executesql #FinalSQL, N'#top int', 10;
Basically, every subquery gets these auxiliary columns:
QueryRank – a constant value (within the subquery's result set) derived from [Ordinal];
RowRank – a ranking assigned to a row based on the [OrderByPredicate].
The result sets are UNIONed and then every entry of every unique value is again ranked (ValueRank) based on the query ranking.
When pulling the final result set, duplicates are suppressed (by the condition ValueRank = 1), and QueryRank and RowRank are used in the ORDER BY clause to preserve the original row order.
I used EXECUTE sp_executesql #query instead of EXECUTE (#query), because the former allows you to add parameters to the query. In particular, I parametrised the number of results to return (the argument of TOP). But you could certainly concatenate that value into the dynamic script directly, just like other things, if you prefer EXECUTE () over EXECUTE sq_executesql.
If you like, you can try this query at SQL Fiddle. (Note: the SQL Fiddle version replaces the #TQueries table variable with the TQueries table.)
This is what I've managed to piece together cobbled from my original response and improved upon by comments from #AndriyM
DECLARE #sql_prefix NVARCHAR(MAX);
SET #sql_prefix =
N'DECLARE #TResults TABLE (
[Ordinal] INT IDENTITY(1,1),
[ContentItemId] INT
);
DECLARE #max INT, #top INT;
SELECT #max = 10;';
DECLARE #sql_insert_template NVARCHAR(MAX), #sql_body NVARCHAR(MAX);
SET #sql_insert_template =
N'SELECT #top = #max - COUNT(*) FROM #TResults;
INSERT INTO #TResults
SELECT TOP (#top) [Id]
FROM [dbo].[SourceTable]
WHERE
{WherePredicate}
AND NOT EXISTS (
SELECT 1
FROM #TResults AS [tr]
WHERE [tr].[ContentItemId] = [SourceTable].[Id]
)
ORDER BY {OrderByPredicate};';
WITH Query ([Ordinal],[SqlCommand]) AS (
SELECT
[Ordinal],
REPLACE(REPLACE(#sql_insert_template, '{WherePredicate}', [WherePredicate]), '{OrderByPredicate}', [OrderByPredicate])
FROM #TQueries
)
SELECT
#sql_body = #sql_prefix + (
SELECT [SqlCommand]
FROM Query
ORDER BY [Ordinal] ASC
FOR XML PATH(''),TYPE).value('.', 'varchar(max)') + CHAR(13)+CHAR(10)
+N' SELECT * FROM #TResults ORDER BY [Ordinal]';
EXEC(#sql_body);
The basic idea is to use a table variable to hold the results of each query. I create a template for the SQL and replace the values in the template based on what is stored in #TQueries.
Once the entire script is completed I run it with EXEC.

Resources