Selecting rows partially matching rows in another table - sql-server

I have a table for the actions.
The table has several slots for the same time on the same day. Same action can't be booked for the same time twice. I'm trying to come up with the way to list all the IDs for an action 'A', such as every available time is listed only once, even if there are both slots available, but if 'A' is book for some time already and another slot for this time is empty, that slot wouldn't be showing.
And it comes to me that I don't know T-SQL that good.
I overcame this by selecting all the rows where 'A' is booked, selecting all distinct (date, time start and time end) which are not booked and doing check whether 'A' is already booked for this time. But all this checking is done on the software level, and those multiple requests to the server and looping in the program to perform the same job as one LIKELY SIMPLE sql request don't look very efficient to me.
If there a way to do something like:
SELECT ID FROM mytable
WHERE Action IS NULL AND (date, time_start, time_end **'ALL TOGETHER IN ONE ROW'**)
NOT IN (SELECT date, time_start, time_end FROM mytable
WHERE Action = 'A')
HAVING 'THOSE THREE BEING DISTINCT'
By other words can I select rows which partially match other rows? It would be simple if I had only one column to compare, but there are three.

In SQL Server we generally use WHILE instead of FOR. I believe what you're trying to do could be fulfilled as follows if you want to loop through the table (ideally your ID field would be the PRIMARY KEY as well). This is just inserting it into a temp table for now, but potentially it should give you the results you want:
-- DECLARE and set counters
DECLARE #curr INT, #prev INT, #max INT
SELECT #curr = 0, #prev = 0, #max = MAX(ID) FROM myTable
-- Make a simple temp table
CREATE TABLE #temp (ID INT)
-- Start looping
WHILE (#curr < #max)
BEGIN
-- Set our counter for the next row
SELECT #curr = MIN(ID) FROM myTable WHERE ID > #prev
-- Populate temp table with a self-join to compare slots
-- Slot must match on date + time but NOT have equal SLOT value
-- Will only INSERT if we meet our criteria i.e. neither slot booked
INSERT INTO #temp
SELECT DISTINCT A.ID
FROM myTable A
JOIN myTable B ON B.[Date] = A.[date] AND B.time_start = A.time_start AND B.time_end = A.time_end
WHERE A.[Action] IS NULL -- Indicates NO booking
AND B.[Action] IS NULL -- Indicates NO booking
AND A.SLOT <> B.SLOT
AND A.ID = #curr
-- Update our counter
SET #prev = #curr
END
-- Get all our records
SELECT * FROM #temp
-- Remove the sleeping dog ;)
DROP TABLE #temp
There is a little bit of redundancy here because it checks ALL rows, even if a condition has been found in the first row of that time slot, but you can tweak it from here if you need to.
You should really avoid using field names like "Date" and "Action" because these are reserved words in SQL.

You question is a bit unclear, but I think this will point you in a productive direction. SQL is designed to perform operation on sets of rows, not to loop through processing one row at a time. The following code will correlate your data into one row for each pair of slots at each date/time. You can use a CASE expression, as shown, to add a column that indicates the status of the row, and you can then add a WHERE clause, not shown, to perform any additional filtering.
-- Sample data.
declare #Samples as Table ( SampleId Int, Slot Int, EventDate Date, StartTime Time(0), EndTime Time(0), Action VarChar(10) );
insert into #Samples ( SampleId, Slot, EventDate, StartTime, EndTime, Action ) values
( 200, 1, '20150501', '00:00:00', '00:30:00', NULL ),
( 201, 2, '20150501', '00:00:00', '00:30:00', NULL ),
( 202, 1, '20150501', '00:30:00', '01:00:00', 'A' ),
( 203, 2, '20150501', '00:30:00', '01:00:00', NULL ),
( 204, 1, '20150501', '01:00:00', '01:30:00', NULL ),
( 205, 2, '20150501', '01:00:00', '01:30:00', 'A' ),
( 206, 1, '20150501', '01:30:00', '02:00:00', 'B' ),
( 207, 2, '20150501', '01:30:00', '02:00:00', 'B' );
select * from #Samples;
-- Data correleated for each date/time.
select Slot1.EventDate, Slot1.StartTime, Slot1.EndTime,
Slot1.Action as Action1, Slot2.Action as Action2,
Coalesce( Slot1.Action, Slot2.Action ) as SummaryAction,
case when Slot1.Action = Slot2.Action then 'ERROR!' else 'Okay.' end as Status
from #Samples as Slot1 inner join
#Samples as Slot2 on Slot2.EventDate = Slot1.EventDate and Slot2.StartTime = Slot1.StartTime and
Slot1.Slot = 1 and Slot2.Slot = 2;

Related

(SOLVED) - First iteration of WHILE loop runs out of memory despite manual reconstruction of query succeeding

Environment: SQL Server 2019 (v15).
I have a large query that uses too much space when run as a single SELECT statement. When I try to run it, I get the following error:
Could not allocate a new page for database 'TEMPDB' because of insufficient disk space in filegroup 'DEFAULT'.
However, the problem breaks down naturally into a dozen or so pieces, so I wrote a WHILE loop to iterate through each piece and insert into a results table. Unfortunately, the first iteration of the WHILE loop also returns the same memory error. All the WHILE loop is doing is changing a few values in the WHERE clause.
The key thing confusing me here, is that when I manually run one iteration of the INSERT statement, absent all looping logic, it works perfectly.
Manually coding the first iteration to use the first institution_name just works, so I don't think the joins here are going wrong and causing the memory error.
WITH my_cte AS
(
SELECT [columns]
FROM mytable a
INNER JOIN bigtable b ON a.institution_name = b.institution_name
AND a.personID = b.personID
WHERE a.institution_name = 'ABC'
AND b.institution_name = 'ABC'
)
INSERT INTO results (personID, institution_name, ...)
SELECT personID, institution_name, [some aggregations]
FROM my_cte
GROUP BY personID, institution_name;
The version with the WHILE loop fails. I need to run the query with different values for institution_name.
Here I show three different values but even just the first iteration fails.
DECLARE #INSTITUTION varchar(10)
DECLARE #COUNTER int
SET #COUNTER = 0
DECLARE #LOOKUP table (temp_val varchar(10), temp_id int)
INSERT INTO #LOOKUP (temp_val, temp_id)
VALUES ('ABC', 1), ('DEF', 2), ('GHI', 3)
WHILE #COUNTER < 3
BEGIN
SET #COUNTER = #COUNTER + 1
SELECT #INSTITUTION = temp_val
FROM #LOOKUP
WHERE temp_id = #COUNTER;
WITH my_cte AS
(
SELECT [columns]
FROM mytable a
INNER JOIN bigtable b ON a.institution_name = b.institution_name
AND a.personID = b.personID
WHERE a.institution_name = #INSTITUTION
AND b.institution_name = #INSTITUTION
)
INSERT INTO results (personID, institution_name, ...)
SELECT personID, institution_name, [some aggregations]
FROM my_cte
GROUP BY personID, institution_name
END
As I write this question, I have quite literally just copy-pasted the insert statement a dozen times, changed the relevant WHERE clause, and run it without errors. Could it be some kind of datatype issue where the query can properly subset if a string literal is put in the WHERE column, but the lookup on my temporary table is failing due to the datatype? I notice that mytable.institution_name is varchar(10) while bigtable.institution_name is nvarchar(10). Setting the temp table to use nvarchar(10) didn't fix it either.

Bulk create rows with a foreign key dependency?

I wrote some SQL statements that work for updating a single customer. I have to update all the customers when this code gets pushed out.
Right now the customer ID is hardcoded and the SQL statements insert one record based on that ID. Prototype works, now I want to do like 10,000 inserts for all of the customers using the same algorithm.
DECLARE #customerID BIGINT = 47636;
DECLARE #limitFourAdjustment MONEY;
DECLARE #appliesToDateTime DATETIME2(7) = SYSUTCDATETIME();
DECLARE #dp_y INT = DATEPART(YEAR, #appliesToDateTime);
DECLARE #dp_m INT = DATEPART(MONTH, #appliesToDateTime);
DECLARE #dp_w INT = DATEPART(WEEK, #appliesToDateTime);
DECLARE #dp_d INT = DATEPART(DAY, #appliesToDateTime);
DECLARE #dp_h INT = DATEPART(HOUR, #appliesToDateTime);
DECLARE #d_h DATETIME2(7) = DATEADD(HOUR, DATEDIFF(HOUR, 0, #appliesToDateTime), 0);
SELECT
#limitFourAdjustment = -COALESCE(SUM(COALESCE(Amount, 0)), 0)
FROM
[dbo].Transactions
WHERE
CustomerID = #customerID AND
IsSystemVoid = 0 AND
TransactionTypeID IN (SELECT ID FROM TransactionTypes WHERE TransactionTypeGroupID = 3)
INSERT INTO dbo.CustomerAccounts_TransactionSummation (CustomerID, LimitTypeID, Y, M, W, D, H, YMDH, Amount)
VALUES (#customerID, 4, #dp_y, #dp_m, #dp_w, #dp_d, #dp_h, #d_h, #limitFourAdjustment);
I tried adding a while loop, seems like not the fastest solution. Maybe collect the ID's first and then feed it to through the loop? My first attempt below doesn't work because I just get the last customer ID, not a unique one every time.
SELECT #numberOfCustomers = COUNT(*)
FROM dbo.Customers
WHILE(#numberOfCustomers > 0)
BEGIN
SELECT #customerID = ID FROM dbo.Customers
OTHER LOGIC FROM ABOVE
SET #numberOfCustomers = #numberOfCustomers - 1;
END
So the question is, how to run these SQL statements (first code block) on every customer's ID?
The key to working with databases is getting your mind around set based operations as opposed to procedural operations. Databases are designed to operate naturally on sets of data at a time, but you have to change how you think about the problem to one where you are manipulating the entire set of data as opposed to one record at a time.
So here is the SQL which I think carry out your complete update in one hit:
INSERT INTO dbo.CustomerAccounts_TransactionSummation (CustomerID, LimitTypeID, Y, M, W, D, H, YMDH, Amount)
SELECT
id
, 4
, #dp_y
, #dp_m
, #dp_w
, #dp_d
, #dp_h
, #d_h
, -COALESCE(SUM(COALESCE(Amount, 0)), 0) limitFourAdjustment
FROM [dbo].Transactions
WHERE IsSystemVoid = 0
and TransactionTypeID IN (SELECT ID FROM TransactionTypes WHERE TransactionTypeGroupID = 3)
--and CustomerID = #customerID
Note that the insert can be combined directly with a select as opposed to using values.

How to design bar chart on SSRS

I want to create reports like below picture's report on SSRS.
Yellow parts mean SET_PHASE,
Green parts mean PROD_PHASE
And my query result like this:
I want to show for per line, all order and I want to show for per order, SETUP and PRODUCTION depends on duratıon time.
SET_PHASE's duration time is SET_DURATION,
PROD_PHASE's duration time is PROD_DURATION
I hope so my query is clear :) Could you help me about issue?
Answer:
Hello Alan,
Current situation I have just these data:
PROD100059335 SETUP PRODUCTION 1 14 LINE 4
PROD100058991 SETUP PRODUCTION 1 5 LINE 6
PROD100059259 SETUP PRODUCTION 2 24 LINE 4
PROD100059188 SETUP PRODUCTION 1 3 LINE 2
PROD100059248 SETUP PRODUCTION 1 15 LINE 2
PROD100059055 SETUP PRODUCTION 2 23 LINE 2
PROD100058754 SETUP PRODUCTION 5 18 LINE 6
And If I use your query I just show "PROD100058754", "PROD100059259", "PROD100059055" these order. I don't understand why other data lost.
until "DECLARE #n TABLE(n int)" part I can show other data. but after that I can not show.
And I applied procedure on SSRS my report shows like this:
I couldn't do correctly and I don't know how can I fix them:(
for example "PROD100059259" this order normally has setup phase but on the report I don't have yellow field.
Do you have any suggestions for me?
OK, here is an attempt to give you what you want but there are a few caveats:
The durations are scaled and no operation can take less than 1 time slot so the setup vs production duration is only approximate
I haven't found a good way of labelling each bar so I've used tooltips
First the code... I've added lots of comments so hopefully you can follow it thru, it's based on your sample data.
NOTE: I've update the table as it now seems like you are using integer durations rather than the 00:00 format from your first example.
-- CREATE A TEST TABLE AND POPULATE IT
DECLARE #data TABLE(STR_ORDER_ID varchar(20), SET_DURATION varchar(10), PROD_DURATION varchar(10), Set_decimal int, Prod_Decimal int, Line varchar(10))
INSERT INTO #data
VALUES
('PROD100059335', NULL, NULL, 1, 14, 'LINE 4'),
('PROD100058991', NULL, NULL,1, 5, 'LINE 6'),
('PROD100059259', NULL, NULL,2, 24, 'LINE 4'),
('PROD100059188', NULL, NULL,1, 3, 'LINE 2'),
('PROD100059248', NULL, NULL,1, 15, 'LINE 2'),
('PROD100059055', NULL, NULL,2, 23, 'LINE 2'),
('PROD100058754', NULL, NULL,5, 18, 'LINE 6')
DECLARE #Gap int = 2 -- determines how many columns we use to separate each order
-- ASSUME durations are in hours/minutes or minutes/seconds and convert them to decimal minutes or decimal seconds respectively
-- COMMENTED THIS AS WE NO LONGER NEED IT. No longer required as durations are now integer values.
--UPDATE d
-- SET
-- Set_decimal = (CAST(LEFT(d.SET_DURATION, len(d.SET_DURATION)-3) AS INT) * 60) + CAST(RIGHT(d.SET_DURATION, 2) AS INT) ,
-- Prod_Decimal = (CAST(LEFT(d.PROD_DURATION, len(d.PROD_DURATION)-3) AS INT) * 60) + CAST(RIGHT(d.PROD_DURATION, 2) AS INT)
--FROM #data d
-- CREATE A NORMALISED TABLE, this will just help to make the next steps simpler
DECLARE #normData TABLE(RowId INT IDENTITY (1,1), Line varchar(10), STR_ORDER_ID varchar(20), OperationOrder int, Operation varchar(10), Duration int)
INSERT INTO #normData (Line, STR_ORDER_ID, OperationOrder, Operation, Duration)
SELECT * FROM (
SELECT Line, STR_ORDER_ID, 1 as OperationOrder , 'SET' as Operation , Set_decimal FROM #data
UNION
SELECT Line, STR_ORDER_ID, 2 , 'PROD' , Prod_decimal FROM #data
UNION
SELECT Line, STR_ORDER_ID, 3 , 'GAP' , #Gap FROM #data ) u -- this adds dummy data that will act as gaps in hte timeline. Change 5 to whatever value suits you best
ORDER BY Line, STR_ORDER_ID, OperationOrder
-- find the largest line running total duration per line and scale it to fit to 240 (so we dont go over 256 column limit in SSRS)
DECLARE #MaxDur INT = (SELECT MAX(rt) FROM (
select *
, SUM(Duration) OVER(PARTITION BY Line ORDER BY Line, STR_ORDER_ID, OperationOrder) AS Rt
from #normData) mRt)
-- Now scale the values back so they fit but don't let any value become less than 1
IF #MaxDur > 240
BEGIN
UPDATE nd
SET Duration = CASE WHEN nd.Duration / (#MaxDur/240) <1 THEN 1 ELSE nd.Duration / (#MaxDur/240) END
FROM #normData nd
END
/* check what we have so far by uncommenting this bit
select *
, SUM(Duration) OVER(PARTITION BY Line ORDER BY Line, STR_ORDER_ID, OperationOrder) AS Rt
from #normData
--*/
-- ================================================================ --
-- At this point you 'may' have enough data to plot a bar chart. == --
-- ================================================================ --
-- CREATE A SIMPLE NUMBERS TABLE, we'll need this to act as our time series
DECLARE #n TABLE(n int)
DECLARE #i int = 0
DECLARE #t int = #MaxDur --(SELECT max(Duration) +5 FROM #normData) -- simple loop counter target set to slightly bigger than our highest duration
WHILE #i<#t
BEGIN
INSERT INTO #n SELECT #i
SET #i = #i +1
END
-- Join our numbers table to our real data
-- This will give us at least 1 row per time slot and associated activity during that time slot.
-- We can plot this driectly as a matrix.
SELECT *
FROM #n n
LEFT JOIN (
-- Sub queries below give use a runnintg total, we then join this back to itself to get the previous
-- running total and this will give us the 'time range' for each operation.
SELECT
a.*
, ISNULL(b.Rt,0)+1 AS TimeStart
, a.RT AS TimeEnd
FROM
(SELECT *
, SUM(Duration) OVER(PARTITION BY Line ORDER BY Line, STR_ORDER_ID, OperationOrder) AS Rt
from #normData
) a
LEFT JOIN
(SELECT *
, SUM(Duration) OVER(PARTITION BY Line ORDER BY Line, STR_ORDER_ID, OperationOrder) AS Rt
from #normData
) b
ON a.RowId = b.RowId + 1 and a.Line = b.Line
) d
ON n.n between d.TimeStart and d.TimeEnd
ORDER BY Line, STR_ORDER_ID, OperationOrder, n, TimeStart, TimeEnd
You can use the code above in your dataset.
The report design:
The report is very simple. It's a matrix with a single row group based on Line and a single column group based on n which is our time slot number.
I've added a blank row to act as a spacer between the 'bars'.
The expression of the cell background is
=SWITCH(
Fields!OperationOrder.Value = 1, "Yellow",
Fields!OperationOrder.Value = 2, "Green",
Fields!OperationOrder.Value = 3, Nothing,
True, Nothing
)
There is also a tooltip which displays STR_ORDER_ID and the operation name.
You get the following output.

Microsoft SQL Server: trigger to check that the sum of a field grouped by another field does not exceed a particular value?

I have a table where a person can log a number of hours on a day:
|__person__|__day__|__hours__|
| 1 | 1 | 4 |
|___2______|__1____|___2_____|
...
I want to create a trigger that doesn't allow the sum of hours to be greater than a specific value, for example 24, for a single person on a specific day.
Multiple rows can be inserted on multiple days simultaneously, and the trigger should then check that each day still has a valid number of hours for each person.
I have tried reading the documentation and similar questions here, but haven't been able to solve this, and have very little experience with SQL Server.
Any help would be appreciated!
You would need a user defined function for this and create a contratins using that user defined functions something like this...
User-Defined Function
CREATE FUNCTION dbo.get_TotalHoursRemaining (
#PersonID INT
, #Day INT
)
RETURNS INT
AS
BEGIN
Declare #Hours INT
SELECT #Hours = ISNULL(SUM([HOURS]), 0)
FROM Test_Table
WHERE Person = #PersonID
AND [DAY] = #Day
GROUP BY Person , [DAY]
SET #Hours = 24 - #Hours;
RETURN #Hours;
END
Constraint
ALTER TABLE Test_Table
ADD CONSTRAINT chk_hours_remaining
CHECK (((dbo.get_TotalHoursRemaining(Person , [Day])) >= 0))
You are looking for constraint,trigger is a overkill in this case.
This is based on assuming ,hours will be logged only once per day against each id
create table dbo.test
(
id int,
hrs int
);
ALTER TABLE dbo.test
ADD CONSTRAINT CHK_hrs CHECK (hrs <= 24);
You also can add constraint to restict least hours:
ALTER TABLE dbo.test
ADD CONSTRAINT CHK_hrs CHECK (hrs > 0 AND hrs <= 24);
insert into dbo.test
select 1,25
You also can have a trigger if each per logs hours mutiple times a day and you want the sum to be < 24
create table dbo.test
(
id int,
hrs int,
dayy datetime
);
create trigger trg_test
on dbo.test
after insert
as
begin
if exists(select id, sum(hrs)
from inserted i
join test t on i.id = t.id
group by id, dayy
having sum(hrs) > 24)
---sum of hours per day exceeded...some thig like that--you can even insert to log table '
insert into logtable
select id, sum(hrs)
from inserted i
join test t on i.id = t.id
group by id, dayy
having sum(hrs) > 24
end
rollback
end
If you create a check constraint on that column, you are able to prevent any value greater than 24 to be inserted. Basically what that does is: when you run an insert statement and the hour sum turns out bigger than 24, you violate the constraint and it kills the transaction immediately.
Check this link to get you started: http://www.w3schools.com/sql/sql_check.asp
Hope this helps :)

With SQL Server, How can I query a table based on a delimited string as the criteria?

I have the following tables:
tbl_File:
FileID | Filename
-----------------
1 | test.jpg
and
tbl_Tag:
TagID | TagName
---------------
1 | Red
and
tbl_TagFile:
ID | TagID | FileID
-------------------
1 | 1 | 1
I need to pass a non-inclusive query against these tables. For example, imagine a list of checkboxes to select one or more tags, and then a search button. I need to pass the TagID's to the query as a PIPE delimited string, such as "1|2|5|"
The search results need to be non-inclusive, such as if it must meet all the criteria. If 3 tags are selected, the results are to be files that have all 3 tags associated with them.
I think I've made this too complicated, but tried iterating over the tags using charindex and stuff to work my way through the string, but it seems there must be an easier way.
I'd like to do this as a function... Such as
SELECT FileID, Filename
FROM tbl_Files
WHERE dbo.udf_FileExistswithTags(#Tags, FileID) = 1
Any efficient way to do this?
It doesn't sound from your example scenario that the actual "need" is to pass a pipe-delimited string. I would highly suggest abandoning that idea and using a Table Value Parameter in your stored procedure. This has numerous advantages in that you will not hit a datatype limit or a "number of parameters" limit that might occur with very large sets of criteria. Additionally it gets away from any need to run a (potentially very slow) UDF.
Split the string into tokens on the application side, and then insert each token as a row in the TVP. Example below:
Create the TVP type in your database:
CREATE TYPE [dbo].[FileNameType] AS TABLE
(
fileName varchar(1000)
)
On the application side, build your list of filename tokens into a recordset:
private static List<SqlDataRecord> BuildFileNameTokenRecords(IEnumerable<string> tokens)
{
var records = new List<SqlDataRecord>();
foreach (string token in tokens){
var record = new SqlDataRecord(
new SqlMetaData[]
{
new SqlMetaData("fileName", SqlDbType.Varchar),
}
);
records.Add(record);
}
return records;
}
Wherever you run your proc from (rough code here):
var records = BuildFileNameTokenRecords(listofstrings);
var sqlCmd = sqlDb.GetStoredProcCommand("FileExists");
sqlDb.AddInParameter(sqlCmd, "tvpFilenameTokens", SqlDbType.Structured, records);
ExecuteNonQuery(sqlCmd);
Filtering your select statement then simply becomes a matter of joining on the tokens in the table parameter. Something like this:
CREATE PROCEDURE dbo.FileExists
(
-- Put additional parameters here
#tvpFilenameTokens dbo.FileNameType READONLY,
)
AS
BEGIN
SELECT FileID, Filename
FROM tbl_Files INNER JOIN #tvpFilenameTokens
ON tbl_Files.FileID = #tvpFilenameTokens.fileName
END
Here is an option that should scale. All of the functionality is available back to SQL Server 2005. It uses a CTE to separate the portion of the query that finds only the FileIDs that have all of the TagIDs passed in, and then that list of FileIDs is joined to the [File] table to get the details. It also uses an INNER JOIN instead of an IN list to match the TagID's.
Please note that the example below uses a SQLCLR splitter that is freely available in the SQL# library (which I wrote, but this function is in the Free version). The specific splitter used is not the important part; it should just be one that is either SQLCLR, an inline tally-table (like the one used in #wewesthemenace's answer), or is the XML method. Just don't use a splitter based on a WHILE-loop or a recursive CTE.
---- TEST SETUP
DECLARE #File TABLE
(
FileID INT NOT NULL PRIMARY KEY,
[Filename] NVARCHAR(200) NOT NULL
);
DECLARE #TagFile TABLE
(
TagID INT NOT NULL,
FileID INT NOT NULL,
PRIMARY KEY (TagID, FileID)
);
INSERT INTO #File VALUES (1, 'File1.txt');
INSERT INTO #File VALUES (2, 'File2.txt');
INSERT INTO #File VALUES (3, 'File3.txt');
INSERT INTO #TagFile VALUES (1, 1);
INSERT INTO #TagFile VALUES (2, 1);
INSERT INTO #TagFile VALUES (5, 1);
INSERT INTO #TagFile VALUES (1, 2);
INSERT INTO #TagFile VALUES (2, 2);
INSERT INTO #TagFile VALUES (4, 2);
INSERT INTO #TagFile VALUES (1, 3);
INSERT INTO #TagFile VALUES (2, 3);
INSERT INTO #TagFile VALUES (5, 3);
INSERT INTO #TagFile VALUES (6, 3);
---- DONE WITH TEST SETUP
DECLARE #TagsToGet VARCHAR(100); -- this would be the proc input parameter
SET #TagsToGet = '1|2|5';
CREATE TABLE #Tags (TagID INT NOT NULL PRIMARY KEY);
DECLARE #NumTags INT;
INSERT INTO #Tags (TagID)
SELECT split.SplitVal
FROM SQL#.String_Split4k(#TagsToGet, '|', 1) split;
SET #NumTags = ##ROWCOUNT;
;WITH files AS
(
SELECT tf.FileID
FROM #TagFile tf
INNER JOIN #Tags tg
ON tg.TagID = tf.TagID
GROUP BY tf.FileID
HAVING COUNT(*) = #NumTags
)
SELECT fl.*
FROM #File fl
INNER JOIN files
ON files.FileID = fl.FileID
ORDER BY fl.[Filename] ASC;
DROP TABLE #Tags; -- don't need this if code above is placed in a proc
Results:
FileID Filename
1 File1.txt
3 File3.txt
Notes
As much as I love TVPs (and I do, when they are done correctly and used appropriately), I would say that they are a bit much for this type of small scale, single dimensional array scenario. There won't really be any performance gain over using a SQLCLR streaming TVF string splitter but it would require more app code and the additional User-Defined Table Type, which can't be updated without first dropping all procs that reference it. That doesn't happen all of the time, but needs to be considered in terms of long-term maintenance costs.
The JOIN between TagFile and the temporary table populated from the split operation should be much more efficient than using an IN list with a subquery for the split operation. An IN list is short-hand for all of the values in it to be their own OR conditions. Hence the JOIN is a fully set-based approach that lets the Query Optimizer do its thang.
The structure I used for the test #TagFile table only has the two relevant IDs in it: TagID and FileID. It does not have the ID field that I assume is an IDENTITY field on this table. Unless there is a very specific reason for needing that IDENTITY field, I would suggest removing it. It adds to inherent benefit as the combination of TagID and FileID is a natural key (i.e. it is both NOT NULL and Unique). And if the Clustered PK of this table were simply those two fields, the JOIN to the temp table of those split-out TagIDs would be quite fast, even with millions of rows in TagFile.
One reason that this approach works so much better than trying to handle this via a function per FileID (outside of the obvious set-based is better than cursor-based reason) is that the list of TagIDs is the same for all files to be checked. So splitting that out more than one time is a waste of effort.
By not splitting the TagID list inline in the query I am able to capture the number of elements in that list with no additional effort. Hence this saves from needing to do a secondary calculation.
Here is a function called DelimitedSplit8K by Jeff Moden. This is used to split strings of length up to 8000. For more info, read this: http://www.sqlservercentral.com/articles/Tally+Table/72993/
CREATE FUNCTION [dbo].[DelimitedSplit8K](
#pString VARCHAR(8000), --WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
#pDelimiter CHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (--10E+1 or 10 rows
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString, t.N, 1) = #pDelimiter
),
cteLen(N1, L1) AS(--==== Return start and length (for use in substring)
SELECT
s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter, #pString, s.N1), 0) - s.N1, 8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
Your query would now be:
DECLARE #pString VARCHAR(8000) = '1|3|5'
SELECT
f.*
FROM tbl_File f
INNER JOIN tbl_TagFile tf ON tf.FileID = f.FileID
WHERE
tf.TagID IN(SELECT CAST(item AS INT) FROM dbo.DelimitedSplit8K(#pString, '|'))
GROUP BY f.FileID, f.FileName
HAVING COUNT(tf.ID) = (LEN(#pString) - LEN(REPLACE(#pString,'|','')) + 1)
The statement below counts the number of TagID in the parameter by counting the occurrence of the delimiter | + 1.
(LEN(#pString) - LEN(REPLACE(#pString,'|','')) + 1)
Here is an option that does not require UDF's.
It can be argued that this is also complicated.
DECLARE #TagList VARCHAR(50)
-- pass in this
SET #TagList = '1|3|6'
SELECT
FinalSet.FileID,
FinalSet.Tag,
FinalSet.TotalMatches
FROM
(
SELECT
tbl_TagFile.FileID,
tbl_TagFile.Tag,
COUNT(*) OVER(PARTITION BY tbl_TagFile.FileID) TotalMatches
FROM
(
SELECT 1 FileID, '1' Tag UNION ALL
SELECT 1 , '2' UNION ALL
SELECT 1 , '3' UNION ALL
SELECT 1 , '6' UNION ALL
SELECT 2 , '1' UNION ALL
SELECT 2 , '3'
) tbl_TagFile
INNER JOIN
(
SELECT tbl_Tag.Tag
FROM
(
SELECT '1' Tag UNION ALL
SELECT '2' UNION ALL
SELECT '3' UNION ALL
SELECT '4' UNION ALL
SELECT '5' UNION ALL
SELECT '6'
) tbl_Tag
WHERE '|' + #TagList + '|' LIKE '%|' + Tag + '|%'
) LimitedTagTable
ON LimitedTagTable.Tag = tbl_TagFile.Tag
) FinalSet
WHERE
FinalSet.TotalMatches = (LEN(#TagList) - LEN(REPLACE(#TagList,'|','')) + 1)
There's some complications in this around data types and indexes and stuff but you can see the concept - you are only getting the records that match your passed in string.
subtable LimitedTagTable is your tag list filtered by your input pipe delimited string
subtable FinalSet joins your limited tag list to your list of files
column TotalMatches works out how many tag matches your file had
Finally this line limits the output to those files that had enough matches:
FinalSet.TotalMatches = (LEN(#TagList) - LEN(REPLACE(#TagList,'|','')) + 1)
Please experiment with different inputs and datasets and see if it suits as I have made a number of assumptions.
I'm answering my own question, in hopes that someone can let me know if/how flawed it is. So far it seems to be working but just early testing.
Function:
ALTER FUNCTION [dbo].[udf_FileExistsByTags]
(
#FileID int
,#Tags nvarchar(max)
)
RETURNS bit
AS
BEGIN
DECLARE #Exists bit = 0
DECLARE #Count int = 0
DECLARE #TagTable TABLE ( FileID int, TagID int )
DECLARE #Tag int
WHILE len(#Tags) > 0
BEGIN
SET #Tag = CAST(LEFT(#Tags, charindex('|', #Tags + '|') -1) as int)
SET #Count = #Count + 1
IF EXISTS (SELECT * FROM tbl_FileTag WHERE FileID = #FileID AND TagID = #Tag )
BEGIN
INSERT INTO #TagTable ( FileID, TagID ) VALUES ( #FileID, #Tag )
END
SET #Tags = STUFF(#Tags, 1, charindex('|', #Tags + '|'), '')
END
SET #Exists = CASE WHEN #Count = (SELECT COUNT(*) FROM #TagTable) THEN 1 ELSE 0 END
RETURN #Exists
END
Then in the query:
SELECT * FROM tbl_File a WHERE dbo.udf_FileExistsByTags(a.FileID, #Tags) = 1
So now I'm looking for errors.
What do you think? Probably not every efficient, however this search will be used only on a periodic basis.

Resources