SQL join into a Recursive CTE with parameters - sql-server

I am trying to use an SQL Query to create some client side reporting for my company. There exists 3 tables that I would like to join together. One of the tables may require a CTE as I need to recursively go through a table and return a row. Here is how the tables are structured (simply).
I want a output table that, for each WorkOrder, displays the most recently completed task in DataCollection (including its time) and the next Op in the TaskListing. I figured a CTE maybe is the only way to recursively go through each row and determine what task is next. (By checking if the completed Op exists in PreOp column). If the completed cell doesn't exist as a preOp it should default to the MAX(Op) (the last task).
CREATE TABLE [dbo].[WorkOrder](
[WorkOrderID][int] NOT NULL PRIMARY KEY,
[Column1] [nvarchar](20),
[Column2] [nvarchar](20)
)
INSERT INTO WorkOrder VALUES(1,'x','y');
INSERT INTO WorkOrder VALUES(2,'x','y');
INSERT INTO WorkOrder VALUES(3,'x2','y2');
CREATE TABLE [dbo].[DataCollection](
[DataCollection][int] NOT NULL PRIMARY KEY,
[WorkOrderID][int] NOT NULL FOREIGN KEY REFERENCES WorkOrder(WorkOrderID),
[CellTask] [nvarchar](20),
[TimeCompleted] [DateTime]
)
INSERT INTO DataCollection VALUES(1,1,'cella','2016-08-09 00:00:00');
INSERT INTO DataCollection VALUES(2,1,'cellb','2016-08-10 00:00:00');
INSERT INTO DataCollection VALUES(3,1,'cellc','2016-08-11 00:00:00');
INSERT INTO DataCollection VALUES(4,2,'cella','2016-08-09 00:00:00');
INSERT INTO DataCollection VALUES(5,2,'cellb','2016-08-10 00:00:00');
CREATE TABLE [dbo].[TaskListing](
[TaskListingID][int] NOT NULL PRIMARY KEY,
[WorkOrderID][int] NOT NULL FOREIGN KEY REFERENCES WorkOrder(WorkOrderID),
[Op][nvarchar](20) NOT NULL,
[preOP][nvarchar](20),
[CellTask][nvarchar](20) NOT NULL,
[Completed][bit] NOT NULL
)
INSERT INTO TaskListing VALUES(1,1,'10',NULL,'cella',0);
INSERT INTO TaskListing VALUES(2,1,'20','10','cellb',0);
INSERT INTO TaskListing VALUES(3,1,'30',NULL,'cellc',1);
INSERT INTO TaskListing VALUES(4,1,'40','10,30','celld',0);
INSERT INTO TaskListing VALUES(5,2,'10',NULL,'cella',1);
INSERT INTO TaskListing VALUES(6,2,'20','10','cellb',1);
INSERT INTO TaskListing VALUES(7,2,'30','20','cellc',0);
The Output table will represent, for each WorkOrder, the most recently completed cell (from the DataCollection Table, TimeCompleted column) & The next cell in the Work Flow (by looking at the rows on the TaskListing Table for the given WorkOrderID and looking for a row that contain the completed task as a 'PreOp'). If it can't find the completed task as a preOp for any other row it should default to the last task.
The part of the Query I'm having most trouble with is filling in the NextTaskCell column. I need to write a query that looks at all the tasks for a given WorkOrderID (In the TaskListing Table) and based on the completed task, determine what is the next task. I'm finding it difficult to feed in both a WorkOrderID & CellTask then find an instance of itself in the PreOp column.
Output Table
+-------------+-------------------+---------------------+--------------+
| WorkOrderId | LastCompletedCell | CompletedOn | NextTaskCell |
|(WorkOrder) | (DataCollection) | (DataCollection) |(TaskListing) |
+-------------+-------------------+---------------------+--------------+
| 1 | cellc | 2016-08-11 00:00:00 | celld |
| 2 | cellb | 2016-08-10 00:00:00 | cellc |
+-------------+-------------------+---------------------+--------------+
I thank you in advance for your time. If there is any other questions please let me know and I'll try to answer them.
Link to SQL Fiddle SQL Fiddle

The following query gives you the expected output you have in your question. You should test this query against a larger dataset to make sure it is correct in all cases.
;WITH
mtc AS ( -- most recent completion date/time for a work order
SELECT
dc.WorkOrderID,
TimeCompleted=MAX(dc.TimeCompleted)
FROM
DataCollection AS dc
GROUP BY
dc.WorkOrderID
),
lop AS ( -- last operation for work order
SELECT
tl.WorkOrderID,
LastOp=MAX(CAST(tl.Op AS INT))
FROM
TaskListing AS tl
GROUP BY
tl.WorkOrderID
)
SELECT
mtc.WorkOrderID,
LastCompletedCell=dc.CellTask,
CompletedOn=dc.TimeCompleted,
NextTaskCell=ISNULL(tl_next.CellTask,tl_last.CellTask)
FROM
mtc
INNER JOIN DataCollection AS dc ON -- the last completed CellTask
dc.WorkOrderID=mtc.WorkOrderID AND
dc.TimeCompleted=mtc.TimeCompleted
INNER JOIN TaskListing AS tl ON -- Op for CellTask
tl.WorkOrderID=mtc.WorkOrderID AND
tl.CellTask=dc.CellTask
INNER JOIN lop ON
lop.WorkOrderID=mtc.WorkOrderID
INNER JOIN TaskListing AS tl_last ON -- CellTask for last Op
tl_last.WorkOrderID=mtc.WorkOrderID AND
tl_last.Op=lop.LastOp
LEFT JOIN TaskListing AS tl_next ON -- Look for next CellTask where Op is a PreOp of another CellTask
tl_next.WorkOrderID=mtc.WorkOrderID AND
','+tl_next.preOP+',' LIKE '%,'+tl.Op+',%'
ORDER BY
mtc.WorkOrderId;
Note: It is a bad idea to store PreOps as a comma-separated string. This is not how you should store data in relational databases. When you do, you will have to resort to more complex and less efficient queries. To wit, see the join condition in tl_next.
Instead you should have a table to store PreOps as separate rows, linking to the parent Op that depends on it.

Related

How to add data to a single column

I have a question in regards to adding data to a particular column of a table, i had a post yesterday where a user guided me (thanks for that) to what i needed and said an update was the way to go for what i need, but i still can't achieve my goal.
i have two tables, the tables where the information will be added from and the table where the information will be added to, here is an example:
source_table (has only a column called "name_expedient_reviser" that is nvarchar(50))
name_expedient_reviser
kim
randy
phil
cathy
josh
etc.
on the other hand i have the destination table, this one has two columns, one with the ids and the other where the names will be inserted, this column values are null, there are some ids that are going to be used for this.
this is how the other table looks like
dbo_expedient_reviser (has 2 columns, unique_reviser_code numeric PK NOT AI, and name_expedient_reviser who are the users who check expedients this one is set as nvarchar(50)) also this is the way this table is now:
dbo_expedient_reviser
unique_reviser_code | name_expedient_reviser
1 | NULL
2 | NULL
3 | NULL
4 | NULL
5 | NULL
6 | NULL
what i need is the information of the source_table to be inserted into the row name_expedient_reviser, so the result should look like this
dbo_expedient_reviser
unique_reviser_code | name_expedient_reviser
1 | kim
2 | randy
3 | phil
4 | cathy
5 | josh
6 | etc.
how can i pass the information into this table? what do i have to do?.
EDIT
the query i saw that should have worked doesn't update which is this one:
UPDATE dbo_expedient_reviser
SET dbo_expedient_reviser.name_expedient_reviser = source_table.name_expedient_reviser
FROM source_table
JOIN dbo_expedient_reviser ON source_table.name_expedient_reviser = dbo_expedient_reviser.name_expedient_reviser
WHERE dbo_expedient_reviser.name_expedient_reviser IS NULL
the query was supposed to update the information into the table, extracting it from the source_table as long as the row name_expedient_reviser is null which it is but is doesn't work.
Since the Names do not have an Id associated with them I would just use ROW_NUMBER and join on ROW_NUMBER = unique_reviser_code. The only problem is, knowing what rows are null. From what I see, they all appear null. In your data, is this the case or are there names sporadically in the table like 5,17,29...etc? If the name_expedient_reviser is empty in dbo_expedient_reviser you could also truncate the table and insert values directly. Hopefully that unique_reviser_code isn't already linked to other things.
WITH CTE (name_expedient_reviser, unique_reviser_code)
AS
(
SELECT name_expedient_reviser
,ROW_NUMBER() OVER (ORDER BY name_expedient_reviser)
FROM source_table
)
UPDATE er
SET er.name_expedient_reviser = cte.name_expedient_reviser
FROM dbo_expedient_reviser er
JOIN CTE on cte.unique_reviser_code = er.unique_reviser_code
Or Truncate:
Truncate Table dbo_expedient_reviser
INSERT INTO dbo_expedient_reviser (name_expedient_reviser, unique_reviser_code)
SELECT DISTINCT
unique_reviser_code = ROW_NUMBER() OVER (ORDER BY name_expedient_reviser)
,name_expedient_reviser
FROM source_table
it is not posible to INSERT the data into a single column, but to UPDATE and move the data you want is the only way to go in that cases

SQL unique PK for grouped data in SP

I am trying to build a temp table with grouped data from multiple tables (in an SP), I am successful in building the data set however I have a requirement that each grouped row have a unique id. I know there are ways to generate unique ids for each row, However the problem I have is that I need the id for a given row to be the same on each run regardless of the number of rows returned.
Example:
1st run:
ID Column A Column B
1 apple 15
2 orange 10
3 grape 11
2nd run:
ID Column A Column B
3 grape 11
The reason I want this is because i am sending this data up to SOLR and when I do a delta I need to have the ID back for the same row as its trying to re-index
Any way I can do this?
Not sure if this will help, not entirely confident of your wider picture, but ...
As your new data is assembled, log each [column a] value in a table of your own.
Give that table an IDENTITY column to do the numbering for you.
Now you can join any new data sets to your lookup table and you'll have a persistent number for each column A.
You just need to ensure that each time you query new data, you add new values to the lookup table.
create table dbo.myRef(
idx int identity(1,1)
,[A] nvarchar(100)
)
General draft as below ...
--- just simulating some input data here
with cte as (
select 'apple' as [A], 15 as [B]
UNION
select 'orange' as [A], 10 as [B]
UNION
select 'banana' as [A], 4 as [B]
)
select * into #temp from cte;
-- Put any new values into the lookup table
-- and they will be assigned a new index number by the identity column
insert into dbo.myRef([A])
select distinct [A]
from #temp where [A] not in (select [A] from dbo.myRef)
-- now pull your original data for output, joining to the lookup table to get a ref number.
select T.*,R.idx
from #temp T
inner join
oer.myRef R
on T.[A] = R.[A]
Sorry for the late reply, i was stuck with something else, however i solved my own issue.
I built 2 temp tables one with all the data from the various tables (#master) and another temp table (#final) to house all the grouped data with an empty column for ID
Next i did a concat(column1, '-',column2,'-', column3) on 3 columns from the #master and updated the #final table based on the type
this helped me to get the same concat ids on each run

How to shift entire row from last to 3rd position without changing values in SQL Server

This is my table:
DocumentTypeId DocumentType UserId CreatedDtm
--------------------------------------------------------------------------
2d47e2f8-4 PDF 443f-4baa 2015-12-03 17:56:59.4170000
b4b-4803-a Images a99f-1fd 1997-02-11 22:16:51.7000000
600-0e32 XL e60e07a6b 2015-08-19 15:26:11.4730000
40f8ff9f Word 79b399715 1994-04-23 10:33:44.2300000
8230a07c email 750e-4c3d 2015-01-10 09:56:08.1700000
How can I shift the last entire row (DocumentType=email) on 3rd position,(before DocumentType=XL) without changing table values?
Without wishing to deny the truth of what others have said here, SQL Server does have CLUSTERED indices. For full details on these and the difference between a clustered table and a non-clustered one, please see here. In effect, a clustered table does have data written to disk in index order. However, due to subsequent insertions and deletions, you should never rely on any given record being in a fixed ordinal position.
To get your data showing email third and XL fourth, you simply need to order by CreatedDtm. Thus:
declare #test table
(
DocumentTypeID varchar(20),
DocumentType varchar(10),
UserID varchar(20),
CreatedDtm datetime
)
INSERT INTO #test VALUES
('2d47e2f8-4','PDF','443f-4baa','2015-12-03 17:56:59'),
('b4b-4803-a','Images','a99f-1fd','1997-02-11 22:16:51'),
('600-0e32','XL','e60e07a6b','2015-08-19 15:26:11'),
('40f8ff9f','Word','79b399715','1994-04-23 10:33:44'),
('8230a07c','email','750e-4c3d','2015-01-10 09:56:08')
SELECT * FROM #test order by CreatedDtm
This gives a result set of:
40f8ff9f Word 79b399715 1994-04-23 10:33:44.000
b4b-4803-a Images a99f-1fd 1997-02-11 22:16:51.000
8230a07c email 750e-4c3d 2015-01-10 09:56:08.000
600-0e32 XL e60e07a6b 2015-08-19 15:26:11.000
2d47e2f8-4 PDF 443f-4baa 2015-12-03 17:56:59.000
This maybe what you are looking for, but I cannot stress enough, that it only gives email 3rd and XL 4th in this particular case. If the dates were different, it would not be so. But perhaps, this was all that you needed?
I assumed that you need to sort by DocumentTypecolumn.
Joining with a temp table, which may contain virtually DocumenTypes with desired SortOrder, you can achieve the result you want.
declare #tbl table(
DocumentTypeID varchar(50),
DocumentType varchar(50)
)
insert into #tbl(DocumentTypeID, DocumentType)
values
('2d47e2f8-4','PDF'),
('b4b-4803-a','Images'),
('600-0e32','XL'),
('40f8ff9f','Word'),
('8230a07c','email')
;
--this will give you original output
select * from #tbl;
--this will output rows with new sort order
select t.* from #tbl t
inner join
(
select *
from
(values
('PDF',1, 1),
('Images',2, 2),
('XL',3, 4),
('Word',4, 5),
('email',5, 3) --here I put new sort order '3'
) as dt(TypeName, SortOrder, NewSortOrder)
) dt
on dt.TypeName = t.DocumentType
order by dt.NewSortOrder
The row positions don't really matter in SQL tables, since it's all unordered sets of data, but if you really want to switch the rows I'd suggest you send all your data to temp table e.g,
SELECT * FROM [tablename] INTO #temptable
then delete/truncate the data from that table (if it won't mess the other tables it's connected to) and use the temp table you made to insert into it as you like, since it'll have all the same fields with the same data from the original.

Delete duplicates from large dataset (>100Mio rows)

I know that this topic came up many times before here but none of the suggested solutions worked for my dataset because my laptop stopped calculating due to memory issues or full storage.
My table looks like the following and has 108 Mio rows:
Col1 |Col2 | Col3 |Col4 |SICComb | NameComb
Case New |3523 | Alexander |6799 |67993523| AlexanderCase New
Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed
Undisclosed|6799 | Case New |3523 |67993523| Case NewUndisclosed
Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed
SmartCard |3674 | NEC |7373 |73733674| NECSmartCard
SmartCard |3674 | Virtual NetComm|7373 |73733674| SmartCardVirtual NetComm
SmartCard |3674 | NEC |7373 |73733674| NECSmartCard
The unique columns are SICComb and NameComb. I tried to add a primary key with:
ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1)
but the integers are filling up more than 30 GB of my storage just in a new minutes.
Which would be the fastest and most efficient method to delete the duplicates from the table?
If you're using SQL Server, you can use delete from common table expression:
with cte as (
select row_number() over(partition by SICComb, NameComb order by Col1) as row_num
from Table1
)
delete
from cte
where row_num > 1
Here all rows will be numbered, you get own sequence for each unique combination of SICComb + NameComb. You can choose which rows you want to delete by choosing order by inside the over clause.
In general, the fastest way to delete duplicates from a table is to insert the records -- without duplicates -- into a temporary table, truncate the original table and insert them back in.
Here is the idea, using SQL Server syntax:
select distinct t.*
into #temptable
from t;
truncate table t;
insert into t
select tt.*
from #temptable;
Of course, this depends to a large extent on how fast the first step is. And, you need to have the space to store two copies of the same table.
Note that the syntax for creating the temporary table differs among databases. Some use the syntax of create table as rather than select into.
EDIT:
Your identity insert error is troublesome. I think you need to remove the identity from the list of columns for the distinct. Or do:
select min(<identity col>), <all other columns>
from t
group by <all other columns>
If you have an identity column, then there are no duplicates (by definition).
In the end, you will need to decide which id you want for the rows. If you can generate a new id for the rows, then just leave the identity column out of the column list for the insert:
insert into t(<all other columns>)
select <all other columns>;
If you need the old identity value (and the minimum will do), turn off identity insert and do:
insert into t(<all columns including identity>)
select <all columns including identity>;

Computed column expression

I have a specific need for a computed column called ProductCode
ProductId | SellerId | ProductCode
1 1 000001
2 1 000002
3 2 000001
4 1 000003
ProductId is identity, increments by 1.
SellerId is a foreign key.
So my computed column ProductCode must look how many products does Seller have and be in format 000000. The problem here is how to know which Sellers products to look for?
I've written have a TSQL which doesn't look how many products does a seller have
ALTER TABLE dbo.Product
ADD ProductCode AS RIGHT('000000' + CAST(ProductId AS VARCHAR(6)) , 6) PERSISTED
You cannot have a computed column based on data outside of the current row that is being updated. The best you can do to make this automatic is to create an after-trigger that queries the entire table to find the next value for the product code. But in order to make this work you'd have to use an exclusive table lock, which will utterly destroy concurrency, so it's not a good idea.
I also don't recommend using a view because it would have to calculate the ProductCode every time you read the table. This would be a huge performance-killer as well. By not saving the value in the database never to be touched again, your product codes would be subject to spurious changes (as in the case of perhaps deleting an erroneously-entered and never-used product).
Here's what I recommend instead. Create a new table:
dbo.SellerProductCode
SellerID LastProductCode
-------- ---------------
1 3
2 1
This table reliably records the last-used product code for each seller. On INSERT to your Product table, a trigger will update the LastProductCode in this table appropriately for all affected SellerIDs, and then update all the newly-inserted rows in the Product table with appropriate values. It might look something like the below.
See this trigger working in a Sql Fiddle
CREATE TRIGGER TR_Product_I ON dbo.Product FOR INSERT
AS
SET NOCOUNT ON;
SET XACT_ABORT ON;
DECLARE #LastProductCode TABLE (
SellerID int NOT NULL PRIMARY KEY CLUSTERED,
LastProductCode int NOT NULL
);
WITH ItemCounts AS (
SELECT
I.SellerID,
ItemCount = Count(*)
FROM
Inserted I
GROUP BY
I.SellerID
)
MERGE dbo.SellerProductCode C
USING ItemCounts I
ON C.SellerID = I.SellerID
WHEN NOT MATCHED BY TARGET THEN
INSERT (SellerID, LastProductCode)
VALUES (I.SellerID, I.ItemCount)
WHEN MATCHED THEN
UPDATE SET C.LastProductCode = C.LastProductCode + I.ItemCount
OUTPUT
Inserted.SellerID,
Inserted.LastProductCode
INTO #LastProductCode;
WITH P AS (
SELECT
NewProductCode =
L.LastProductCode + 1
- Row_Number() OVER (PARTITION BY I.SellerID ORDER BY P.ProductID DESC),
P.*
FROM
Inserted I
INNER JOIN dbo.Product P
ON I.ProductID = P.ProductID
INNER JOIN #LastProductCode L
ON P.SellerID = L.SellerID
)
UPDATE P
SET P.ProductCode = Right('00000' + Convert(varchar(6), P.NewProductCode), 6);
Note that this trigger works even if multiple rows are inserted. There is no need to preload the SellerProductCode table, either--new sellers will automatically be added. This will handle concurrency with few problems. If concurrency problems are encountered, proper locking hints can be added without deleterious effect as the table will remain very small and ROWLOCK can be used (except for the INSERT which will require a range lock).
Please do see the Sql Fiddle for working, tested code demonstrating the technique. Now you have real product codes that have no reason to ever change and will be reliable.
I would normally recommend using a view to do this type of calculation. The view could even be indexed if select performance is the most important factor (I see you're using persisted).
You cannot have a subquery in a computed column, which essentially means that you can only access the data in the current row. The only ways to get this count would be to use a user-defined function in your computed column, or triggers to update a non-computed column.
A view might look like the following:
create view ProductCodes as
select p.ProductId, p.SellerId,
(
select right('000000' + cast(count(*) as varchar(6)), 6)
from Product
where SellerID = p.SellerID
and ProductID <= p.ProductID
) as ProductCode
from Product p
One big caveat to your product numbering scheme, and a downfall for both the view and UDF options, is that we're relying upon a count of rows with a lower ProductId. This means that if a Product is inserted in the middle of the sequence, it would actually change the ProductCodes of existing Products with a higher ProductId. At that point, you must either:
Guarantee the sequencing of ProductId (identity alone does not do this)
Rely upon a different column that has a guaranteed sequence (still dubious, but maybe CreateDate?)
Use a trigger to get a count at insert which is then never changed.

Resources