How to avoid using a cursor in this T-SQL example - sql-server

I'm converting a VBA application to run natively in a set of Microsoft SQL Server stored procedures. Since I'm using recordsets in VBA the direct translation would be to use a cursor in SQL Server. Unlike some other databases, SQL Server has terrible performance with cursors and I see advice to avoid them like the plague. So I'm looking for advice on a direct way to code this problem.
I have a set of tasks and a set of people to work those tasks. There are rules so that only certain people can work certain tasks. The goal is to distribute the tasks as evenly as possible to the people.
The VBA algorithm is:
Select all the tasks for a given rule.
select the single person who matches that rule who has the lowest number of tasks assigned to them. Requerying each time to assure that I get the person with the lowest number after each update.
Assign that person to that task.
Increment the person's count of tasks assigned.
Next Task
If not end of Task goto 2.
The only ways I see to do this are with a cursor or a real ugly while loop. With a cursor the logic moves over with the same steps from the VBA application.
The WHILE Loop I envision in pseudo code:
While exists unassigned tasks
select a single task
select the person with the least tasks that can work that task type
assign person to task
Increment person's count
Does anyone have any better suggestions?

Not everything should be done in a stored procedure. This sounds like a scenario where you might select all the datasets, process assignment on client side (or SQLCLR), then push back the results.
Unless the problem is so massive to prevent it (millions of people, tasks, and rules), keeping it outside of the database should keep it more maintainable.

I'm new here but have been writing SQL/stored procedures like this for a few years.
Anyway, I've found that there are some times you cannot avoid a cursor (or while loop, etc) - particularly when the results of the next iteration depend on the results of this iteration. This applies in your case, as you're assigning tasks based on how many tasks they have.
The good news is that, based on what you're doing (assigning people to tasks) I'm guessing you're only talking very small numbers of loops and/or only running infrequently. As such, a cursor should still work fine.
The key thing to be mindful of in this is to keep the code within the cursor as simple as possible and ensure you don't need to keep doing big reads/etc. Instead, do any harder coding outside the cursor to get the data in a reasonable format (e.g., in temp tables) then run the cursor over the top.

Since you do not give us the structure of the table, it's kind of hard to try to give you a solution that will suit your needs. Of course, this is only an example of how you can rethink your logic, it's up to you to adapt it, specially based on the volume of your dataset which i assume must be small, and fill the gap for how you apply the rules which person can be assigned to which task
But, here's a strategy you can use (only if you understand it, since you'll have to maintain it!).
Let say you have theses tables:
create table Task (id int identity(1,1) not null, taskname varchar(30));
create table Person (id int identity(1,1), name varchar(30) not null);
create table AssignedTask (id int identity(1,1) not null, idPerson int not null, idTask int not null)
with theses values :
insert into Task (taskname) values ('Task1'),('Task2'),('Task3'), ('Task4'), ('Task5'), ('Task6'),('Task7'),('Task8'),('Task9'),('Task10')
insert into Person (name) values ('Person1'),('Person2'), ('Person3'), ('Person4')
insert into AssignedTask (idPerson, idTask) values (1,3),(2,4),(4,1),(4,5)
The current situation is :
select p.id, p.name, 'is assigned to', t.id as TaskId, t.taskname
from AssignedTask at
inner join Person p on p.id = at.idPerson
inner join Task t on t.id = at.idTask
order by p.name
id name TaskId taskname
----- ----------- -------------- ----- ---------
1 Person1 is assigned to 3 Task3
2 Person2 is assigned to 4 Task4
4 Person4 is assigned to 1 Task1
4 Person4 is assigned to 5 Task5
The idea behind this logic is to "pre-compute" the order of which the task will be assigned by computing the number of already assigned task + a row number and then, sorting the data by the least.
-- unassigned task with a TaskSequence to be able to perform an join on a row number by NbAssignedTaskIfAssigned
;with UnassignedTask as (
select row_number() over (order by id) as TaskSeq, id
from task
where id not in (select idTask from AssignedTask)
)
-- Get Number of task assigned by person
, PersonTask as (
select idPerson, count(1) as NbAssignedTask
from AssignedTask
group by idPerson
)
-- this is where the logic is
, ExpandThis as (
select p.id as PersonId, coalesce(pt.nbAssignedTask,0) as nbAssignedTask
-- this allow us to prioritize which person should be assigned a task
, coalesce(pt.nbAssignedTask,0) + row_number() over (partition by idPerson order by idperson) as NbAssignedTaskIfAssigned
from Person p
left join PersonTask pt on pt.idPerson = p.id
-- this is where the magic happens. Since you did not provides us how
-- the rule for assignment is enforce (either in a table on in code),
-- i choose to not apply any rule at all, so every person can be assigned
-- to any task. Change this line to a "cross apply ( select .... from
-- MyRules where ... ) x" to fit you needs.
cross apply UnassignedTask x
)
, Prioritize as (
select *
-- Just use a basic row_number to order which row must be assigned first
, ROW_NUMBER() over (order by NbAssignedTaskIfAssigned) as [Priority]
from ExpandThis et
)
select p.PersonId, ps.name, t.id as TaskId, t.taskname
from Prioritize p
inner join UnassignedTask ut on ut.TaskSeq = p.[Priority]
inner join Person ps on ps.id = p.PersonId
inner join Task t on t.id = ut.id
order by p.[Priority]
and the result is :
PersonId name TaskId taskname
----------- --------------- -------- ------------------------------
3 Person3 2 Task2
3 Person3 6 Task6
1 Person1 7 Task7
2 Person2 8 Task8
2 Person2 9 Task9
4 Person4 10 Task10
Person 3 did not have any assigned task, so he comes first. Second, third and fourth rows are both equals (assuming task 2 is assigned to person3), so they get assigned the next 3 tasks. Finally, all four person have now an equal number of task assigned but only 2 tasks remaining, the first 2 persons get assigned. you can fiddle this by modifying the order by clause.
Now, you can insert this result into AssignedTask, no need of a cursor nor a loop to perform this task.

Related

SQL Server: how to do an efficient cross join

My data is structured as follows:
create table data (id int, cluster int, weight float);
insert into data values (99,1,4);
insert into data values (99,2,3);
insert into data values (99,3,4);
insert into data values (1234,2,5);
insert into data values (1234,3,2);
insert into data values (1234,4,3);
Then I have to impute some values because the vector is of certain lenght x:
declare #startnum int=0
declare #endnum int=4036;
with gen as (select #startnum as num
union ALL
select num+1 from gen where num+1<=#endnum)
select * into gen from gen -- store the numbers
option(maxrecursion 10000)
I then have to cross join the values stored in gen but this is done on two very large tables (not as in the current example), currently my query is running for over 2 hours and I start to think there is something wrong. Any ideas on how I can make this procedure faster and more correct?
Here's what I doing right now.
select id, cluster, max(v) as weight
from (select id, cluster, case when cluster=num then weight else 0 end as v
from data
cross join gen) cross_num
group by id, cluster;
go
EDIT: It is the last query that is running very slowly, and of course I have a super large dataset :)
Note: I also wonder what the Paste the Plan is exactly, I actually don't know how to look for this, can someone give me a resource I can look up and try to understand it?
So, the problem here is that you're creating a massive Cartesian product and aggregating at the same time.
However, we might be able to cheat if your data lines up well. This may also totally backfire if it lines up poorly. I can't see your data so I don't know what's going on.
I'm going to write this using an empty staging table or temp table. You could write it as a series of CTE expressions, but in my experience those do not perform quite as well. Ideally you can take these queries and wrap them in a stored procedure.
So, the primary key for your table can't be id, cluster, because you're aggregating on that group. If id, cluster is not very selective -- meaning that there are a lot of records for each id, cluster combination -- then we might be able to significantly reduce the amount of work done here. If there's 5 records for each id, cluster, then this will probably not help much but if there's 100,000 for each id, cluster then it will probably help a lot.
First, create your gen table. I recommend creating a clustered primary key on gen.num.
Second, let's start building the data. Remember, I'm assuming StagingTable is empty.
Here's the first query that does the real work:
INSERT INTO StagingTable (id, cluster, weight)
SELECT id, cluster, MAX(weight) AS weight
FROM data
GROUP BY id, cluster
The query would benefit from an index, but it will depend on your data if id, cluster, weight is better or worse than cluster, id, weight. However, before you run this you should disable any indexes on StagingTable and then rebuild the index after running at least this first insert.
Depending on your data, you may require or benefit from or should avoid using a WHERE cluster BETWEEN 0 AND 4036 clause on the above query as well. It's not clear to me if there are 4037 clusters numbered 0 to 4036, or if you're only interested in clusters 0 to 4036 but there are more, or if you're only interested in creating "default" records of weight 0 for clusters 0 to 4036 but want all clusters aggregated if they happen to go higher.
Now, think about what's in StagingTable. Everything that we've loaded into that table is everywhere that there is an id, cluster in the data table. Critically, every id we might need will be in StagingTable, even if it's missing one or more values for cluster.
Now we just need to fill in the missing cluster values for each id, and we know that the weight of the missing clusters is 0.
INSERT INTO StagingTable (id, cluster, weight)
SELECT DISTINCT s.id, g.num, 0
FROM StagingTable s
INNER JOIN gen g
ON g.num BETWEEN 0 AND 4036
WHERE NOT EXISTS (
SELECT 1
FROM StagingTable s2
WHERE s2.id = s.id
AND s2.cluster = g.num
)
The INNER JOIN gen g ON g.num BETWEEN 0 AND 4036 may not be necessary if gen is always going to be numbers 0 to 4036. In that case you can just use CROSS JOIN gen g.
The EXISTS is necessary to remove the duplicate rows.
Again, this query could benefit from an index on StagingTable, but without seeing your actual data it's a little difficult to tell exactly what you need (id, cluster) is one possibility, but (cluster, id) may actually work better. Ideally, it should be a clustered primary key.
Edit: Just realized my original second query wouldn't work in some cases. I've modified it to correct the logic.

SQL Server - Update All Records, Per Group, With Result of SubQuery

If anyone could even just help me phrase this question better I'd appreciate it.
I have a SQL Server table, let's call it cars, which contains entries representing items and information about their owners including car_id, owner_accountNumber, owner_numCars.
We're using a system that sorts 'importantness of owner' based on number of cars owned, and relies on the owner_numCars column to do so. I'd rather not adjust this, if reasonably possible.
Is there a way I can update owner_numCars per owner_accountNumber using a stored procedure? Maybe some other more efficient way I can accomplish every owner_numCars containing the count of entries per owner_accountNumber?
Right now the only way I can think to do this is to (from the c# application):
SELECT owner_accountNumber, COUNT(*)
FROM mytable
GROUP BY owner_accountNumber;
and then foreach row returned by that query
UPDATE mytable
SET owner_numCars = <count result>
WHERE owner_accountNumber = <accountNumber result>
But this seems wildly inefficient compared to having the server handle the logic and updates.
Edit - Thanks for all the help. I know this isn't really a well set up database, but it's what I have to work with. I appreciate everyone's input and advice.
This solution takes into account that you want to keep the owner_numCars column in the CARs table and that the column should always be accurate in real time.
I'm defining table CARS as a table with attributes about cars including it's current owner. The number of cars owned by the current owner is de-normalized into this table. Say I, LAS, own three cars, then there are three entries in table CARS, as such:
car_id owner_accountNumber owner_numCars
1 LAS1 3
2 LAS1 3
3 LAS1 3
For owner_numCars to be used as an importance factor in a live interface, you'd need to update owner_numCars for every car every time LAS1 sells or buys a car or is removed from or added to a row.
Note you need to update CARS for both the old and new owners. If Sam buys car1, both Sam's and LAS' totals need to be updated.
You can use this procedure to update the rows. This SP is very context sensitive. It needs to be called after rows have been deleted or inserted for the deleted or inserted owner. When an owner is updated, it needs to be called for both the old and new owners.
To update real time as accounts change owners:
create procedure update_car_count
#p_acct nvarchar(50) -- use your actual datatype here
AS
update CARS
set owner_numCars = (select count(*) from CARS where owner_accountNumber = #p_acct)
where owner_accountNumber = #p_acct;
GO
To update all account_owners:
create procedure update_car_count_all
AS
update C
set owner_numCars = (select count(*) from CARS where owner_acctNumber = C.owner_acctNumber)
from CARS C
GO
I think what you need is a View. If you don't know, a View is a virtual table that displays/calculates data from a real table that is continously updated as the table data updates. So if you want to see your table with owner_numCars added you could do:
SELECT a.*, b.owner_numCars
from mytable as a
inner join
(SELECT owner_accountNumber, COUNT(*) as owner_numCars
FROM mytable
GROUP BY owner_accountNumber) as b
on a.owner_accountNumber = b.owner_accountNumber
You'd want to remove the owner_numCars column from the real table since you don't need to actually store that data on each row. If you can't remove it you can replace a.* with an explicit list of all the fields except owner_numCars.
You don't want to run SQL to update this value. What if it doesn't run for a long time? What if someone loads a lot of data and then runs the score and finds a guy that has 100 cars counts as a zero b/c the update didn't run. Data should only live in 1 place, updating has it living in 2. You want a view that pulls this value from the tables as it is needed.
CREATE VIEW vOwnersInfo
AS
SELECT o.*,
ISNULL(c.Cnt,0) AS Cnt
FROM OWNERS o
LEFT JOIN
(SELECT OwnerId,
COUNT(1) AS Cnt
FROM Cars
GROUP BY OwnerId) AS c
ON o.OwnerId = c.OwnerId
There are a lot of ways of doing this. Here is one way using COUNT() OVER window function and an updatable Common Table Expression [CTE]. That you won't have to worry about relating data back, ids etc.
;WITH cteCarCounts AS (
SELECT
owner_accountNumber
,owner_numCars
,NewNumberOfCars = COUNT(*) OVER (PARTITION BY owner_accountNumber)
FROM
MyTable
)
UPDATE cteCarCounts
SET owner_numCars = NewNumberOfCars
However, from a design perspective I would raise the question of whether this value (owner_numCars) should be on this table or on what I assume would be the owner table.
Rominus did make a good point of using a view if you want the data to always reflect the current value. You could also use also do it with a table valued function which could be more performant than a view. But if you are simply showing it then you could simply do something like this:
SELECT
owner_accountNumber
,owner_numCars = COUNT(*) OVER (PARTITION BY owner_accountNumber)
FROM
MyTable
By adding a where clause to either the CTE or the SELECT statement you will effectively limit your dataset and the solution should remain fast. E.g.
WHERE owner_accountNumber = #owner_accountNumber

Field is being updated with same value

I have a table that has a new column, and updating the values that should go in the new column. For simplicity sake I am reducing my example table structure as well as my query. Below is how i want my output to look.
IDNumber NewColumn
1 1
2 1
3 1
4 2
5 2
WITH CTE_Split
AS
(
select
*,ntile(2) over (order by newid()) as Split
from TableA
)
Update a
set NewColumn = a.Split
from CTE_Split a
Now when I do this I get my table and it looks as such
IDNumber NewColumn
1 1
2 1
3 1
4 1
5 1
However when I do the select only I can see that I get the desire output, now I have done this before to split result sets into multiple groups and everything works within the select but now that I need to update the table I am getting this weird result. Not quiet sure what I'm doing wrong or if anyone can provide any sort of feedback.
So after a whole day of frustration I was able to compare this code and table to another that I had already done this process to. The reason that this table was getting updated to all 1s was because turns out that whoever made the table thought this was supposed to be a bit flag. When it reality it should be an int because in this case its actually only two possible values but in others it could be more than two.
Thank you for all your suggestion and help and it should teach me to scope out data types of tables when using the ntile function.
Try updating your table directly rather than updating your CTE. This makes it clearer what your UPDATE statement does.
Here is an example:
WITH CTE_Split AS
(
SELECT
*,
ntile(2) over (order by newid()) as Split
FROM TableA
)
UPDATE a
SET NewColumn = c.Split
FROM
TableA a
INNER JOIN CTE_Split c ON a.IDNumber = c.IDNumber
I assume that what you want to achieve is to group your records into two randomized partitions. This statement seems to do the job.

SQL join conditional either or not both?

I have 3 tables that I'm joining and 2 variables that I'm using in one of the joins.
What I'm trying to do is figure out how to join based on either of the statements but not both.
Here's the current query:
SELECT DISTINCT
WR.Id,
CAL.Id as 'CalendarId',
T.[First Of Month],
T.[Last of Month],
WR.Supervisor,
WR.cd_Manager as [Manager], --Added to search by the Manager--
WR.[Shift] as 'ShiftId'
INTO #Workers
FROM #T T
--Calendar
RIGHT JOIN [dbo].[Calendar] CAL
ON CAL.StartDate <= T.[Last of Month]
AND CAL.EndDate >= T.[First of Month]
--Workers
--This is the problem join
RIGHT JOIN [dbo].[Worker_Filtered]WR
ON WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN(#Supervisors))
or (WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN(#Supervisors))
AND WR.cd_Manager IN(SELECT Id FROM [dbo].[User] WHERE FullNameIN(#Manager))) --Added to search by the Manager--
AND WR.[Type] = '333E7907-EB80-4021-8CDB-5380F0EC89FF' --internal
WHERE CAL.Id = WR.Calendar
AND WR.[Shift] IS NOT NULL
What I want to do is either have the result based on the Worker_Filtered table matching the #Supervisor or (but not both) have it matching both the #Supervisor and #Manager.
The way it is now if it matches either condition it will be returned. This should be limiting the returned results to Workers that have both the Supervisor and Manager which would be a smaller data set than if they only match the Supervisor.
UPDATE
The query that I have above is part of a greater whole that pulls data for a supervisor's workers.
I want to also limit it to managers that are under a particular supervisor.
For example, if #Supervisor = John Doe and #Manager = Jane Doe and John has 9 workers 8 of which are under Jane's management then I would expect the end result to show that there are only 8 workers for each month. With the current query, it is still showing all 9 for each month.
If I change part of the RIGHT JOIN to:
WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN (#Supervisors))
AND WR.cd_Manager IN(SELECT Id FROM [dbo].[User] WHERE FullName IN(#Manager))
Then it just returns 12 rows of NULL.
UPDATE 2
Sorry, this has taken so long to get a sample up. I could not get SQL Fiddle to work for SQL Server 2008/2014 so I am using rextester instead:
Sample
This shows the results as 108 lines. But what I want to show is just the first 96 lines.
UPDATE 3
I have made a slight update to the Sample. this does get the results that I want. I can set #Manager to NULL and it will pull all 108 records, or I can have the correct Manager name in there and it'll only pull those that match both Supervisor and Manager.
However, I'm doing this with an IF ELSE and I was hoping to avoid doing that as it duplicates code for the insert into the Worker table.
The description of expected results in update 3 makes it all clear now, thanks. Your 'problem' join needs to be:
RIGHT JOIN Worker_Filtered wr on (wr.Supervisor in(#Supervisors)
and case when #Manager is null then 1
else case when wr.Manager in(#Manager) then 1 else 0 end
end = 1)
By the way, I don't know what you are expecting the in(#Supervisors) to achieve, but if you're hoping to supply a comma separated list of supervisors as a single string and have wr.Supervisor match any one of them then you're going to be disappointed. This query works exactly the same if you have = #Supervisors instead.

Computed column expression

I have a specific need for a computed column called ProductCode
ProductId | SellerId | ProductCode
1 1 000001
2 1 000002
3 2 000001
4 1 000003
ProductId is identity, increments by 1.
SellerId is a foreign key.
So my computed column ProductCode must look how many products does Seller have and be in format 000000. The problem here is how to know which Sellers products to look for?
I've written have a TSQL which doesn't look how many products does a seller have
ALTER TABLE dbo.Product
ADD ProductCode AS RIGHT('000000' + CAST(ProductId AS VARCHAR(6)) , 6) PERSISTED
You cannot have a computed column based on data outside of the current row that is being updated. The best you can do to make this automatic is to create an after-trigger that queries the entire table to find the next value for the product code. But in order to make this work you'd have to use an exclusive table lock, which will utterly destroy concurrency, so it's not a good idea.
I also don't recommend using a view because it would have to calculate the ProductCode every time you read the table. This would be a huge performance-killer as well. By not saving the value in the database never to be touched again, your product codes would be subject to spurious changes (as in the case of perhaps deleting an erroneously-entered and never-used product).
Here's what I recommend instead. Create a new table:
dbo.SellerProductCode
SellerID LastProductCode
-------- ---------------
1 3
2 1
This table reliably records the last-used product code for each seller. On INSERT to your Product table, a trigger will update the LastProductCode in this table appropriately for all affected SellerIDs, and then update all the newly-inserted rows in the Product table with appropriate values. It might look something like the below.
See this trigger working in a Sql Fiddle
CREATE TRIGGER TR_Product_I ON dbo.Product FOR INSERT
AS
SET NOCOUNT ON;
SET XACT_ABORT ON;
DECLARE #LastProductCode TABLE (
SellerID int NOT NULL PRIMARY KEY CLUSTERED,
LastProductCode int NOT NULL
);
WITH ItemCounts AS (
SELECT
I.SellerID,
ItemCount = Count(*)
FROM
Inserted I
GROUP BY
I.SellerID
)
MERGE dbo.SellerProductCode C
USING ItemCounts I
ON C.SellerID = I.SellerID
WHEN NOT MATCHED BY TARGET THEN
INSERT (SellerID, LastProductCode)
VALUES (I.SellerID, I.ItemCount)
WHEN MATCHED THEN
UPDATE SET C.LastProductCode = C.LastProductCode + I.ItemCount
OUTPUT
Inserted.SellerID,
Inserted.LastProductCode
INTO #LastProductCode;
WITH P AS (
SELECT
NewProductCode =
L.LastProductCode + 1
- Row_Number() OVER (PARTITION BY I.SellerID ORDER BY P.ProductID DESC),
P.*
FROM
Inserted I
INNER JOIN dbo.Product P
ON I.ProductID = P.ProductID
INNER JOIN #LastProductCode L
ON P.SellerID = L.SellerID
)
UPDATE P
SET P.ProductCode = Right('00000' + Convert(varchar(6), P.NewProductCode), 6);
Note that this trigger works even if multiple rows are inserted. There is no need to preload the SellerProductCode table, either--new sellers will automatically be added. This will handle concurrency with few problems. If concurrency problems are encountered, proper locking hints can be added without deleterious effect as the table will remain very small and ROWLOCK can be used (except for the INSERT which will require a range lock).
Please do see the Sql Fiddle for working, tested code demonstrating the technique. Now you have real product codes that have no reason to ever change and will be reliable.
I would normally recommend using a view to do this type of calculation. The view could even be indexed if select performance is the most important factor (I see you're using persisted).
You cannot have a subquery in a computed column, which essentially means that you can only access the data in the current row. The only ways to get this count would be to use a user-defined function in your computed column, or triggers to update a non-computed column.
A view might look like the following:
create view ProductCodes as
select p.ProductId, p.SellerId,
(
select right('000000' + cast(count(*) as varchar(6)), 6)
from Product
where SellerID = p.SellerID
and ProductID <= p.ProductID
) as ProductCode
from Product p
One big caveat to your product numbering scheme, and a downfall for both the view and UDF options, is that we're relying upon a count of rows with a lower ProductId. This means that if a Product is inserted in the middle of the sequence, it would actually change the ProductCodes of existing Products with a higher ProductId. At that point, you must either:
Guarantee the sequencing of ProductId (identity alone does not do this)
Rely upon a different column that has a guaranteed sequence (still dubious, but maybe CreateDate?)
Use a trigger to get a count at insert which is then never changed.

Resources