I have a MS SQL Server database for a restaurant. There is a transactional table that shows the meals ordered and, where applicable, the optional (choice) items ordered with each meal e.g. "with chips", "with beans".
I am trying to determine, for each choice item, what the main meal was that it was ordered with.
How can I write a query in such a way that it doesn't matter how many choice items there are for a particular main meal, it will always find the appropriate main meal? We can assume that each choice item belongs to the same main meal until a new main meal appears.
The expected outcome might look something like this (first and last columns created by the query, middle columns are present in the table): -
RowWithinTxn
TransactionID
LineNumber
CourseCategory
ItemName
AssociatedMainMeal
1
123
123456
Main
Steak
NULL
2
123
123457
Choice
Chips
Steak
3
123
123458
Choice
Beans
Steak
1
124
124567
Main
Fish
NULL
2
124
124568
Choice
Mushy Peas
Fish
As it's possible to have more than one choice item with the same main meal, something like LAST_VALUE() or LAG() (where we lag by 1), won't always give the correct answers (for example: -)
CASE WHEN CourseCategory = 'Choice Items' THEN
LAST_VALUE(ItemName)
OVER(PARTITION BY TransactionID
ORDER BY LineNumber ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
ELSE NULL END AS AssociatedMainMeal
OR
CASE WHEN CourseCategory = 'Choice Items' THEN
LAG(ItemName,1,0)
OVER(PARTITION BY TransactionID
ORDER BY LineNumber)
ELSE NULL END AS AssociatedMainMeal
I suspect ROW_NUMBER() might be useful somewhere, i.e.: ROW_NUMBER() OVER(PARTITION BY TransactionID ORDER BY LineNumber) AS RowWithinTxn, as this would chop the table up into separate TransactionIDs
Many Thanks!
You want to use conditional aggregation here:
MAX(CASE CourseCategory WHEN 'Main' THEN ItemName END) OVER (PARTITION BY TransactionID)
Assuming that you could have many main meals per transaction, then effectively need a LAG function that can ignore certain rows. There is an IGNORE NULLS option in some RDMBS, where you could use something like:
LAG(CASE WHEN CourseCategoruy = 'Main' THEN ItemName END) IGNORE NULLS
OVER(PARTITION BY TransactionID ORDER BY Linenumber)
Unfortunately SQL Server does not support this, but Itzik Ben-Gan came up with a nice workaround using windowed functions. Essentially if you combine your sorting column and your data column into a single binary value, you can use MAX in a similar way to LAG since the max binary value will always be the previous row since the sorting column is part of it.
It does however make for quite painful reading because you have to convert two columns to binary and combine them:
CONVERT(BINARY(4), t.Linenumber) +
CONVERT(BINARY(50), CASE WHEN t.CourseCategory = 'Main' THEN t.ItemName END)
Then get the max of them
MAX(<result of above expression>) OVER(PARTITION BY t.TransactionID ORDER BY t.LineNumber)
Then remove the first 4 characters (from linenumber) and convert back to varchar:
CONVERT(VARCHAR(50), SUBSTRING(<result of above expression>, 5, 50)
Then finally, you only want to show this value for non main meals
CASE WHEN t.CourseCategory <> 'Main' THEN <result of above expression> END
Putting this all together, along with some sample data you end up with:
DROP TABLE IF EXISTS #T;
CREATE TABLE #T
(
RowWithinTxn INT,
TransactionID INT,
LineNumber INT,
CourseCategory VARCHAR(6),
ItemName VARCHAR(10)
);
INSERT INTO #T
(
RowWithinTxn,
TransactionID,
LineNumber,
CourseCategory,
ItemName
)
VALUES
(1, 123, 123456, 'Main', 'Steak'),
(2, 123, 123457, 'Choice', 'Chips'),
(3, 123, 123458, 'Choice', 'Beans'),
(1, 124, 124567, 'Main', 'Fish'),
(2, 124, 124568, 'Choice', 'Mushy Peas');
SELECT t.RowWithinTxn,
t.TransactionID,
t.LineNumber,
t.CourseCategory,
t.ItemName,
AssociatedMainMeal = CASE WHEN t.CourseCategory <> 'Main' THEN
CONVERT(VARCHAR(50), SUBSTRING(MAX(CONVERT(BINARY(4), t.Linenumber) + CONVERT(BINARY(50), CASE WHEN t.CourseCategory = 'Main' THEN t.ItemName END))
OVER(PARTITION BY t.TransactionID ORDER BY t.LineNumber), 5, 50))
END
FROM #T AS t;
Related
columns are: Name, Location_Name, Location_ID
I want to check Names and Location_ID, and if there are two that are the same I want to delete/remove that row.
For example: If Name John Fox at location id 4 shows up two or more times I want to just keep one.
After that, I want to count how many people per location.
Location_Name1: 45
Location_Name2: 66
Etc...
The location name and Location Id are related.
Sample data
Code I tried
Deleting duplicates is a common pattern. You apply a sequence number to all of the duplicates, then delete any that aren't first. In this case I order arbitrarily but you can choose to keep the one with the lowest PK value or that was modified last or whatever - just update the ORDER BY to sort the one you want to keep first.
;WITH cte AS
(
SELECT *, rn = ROW_NUMBER() OVER
(PARTITION BY Name, Location_ID ORDER BY ##SPID)
FROM dbo.TableName
)
DELETE cte WHERE rn > 1;
Then to count, assuming there can't be two different Location_IDs for a given Location_Name (this is why schema + sample data is so helpful):
SELECT Location_Name, People = COUNT(Name)
FROM dbo.TableName
GROUP BY Location_Name;
Example db<>fiddle
If Location_Name and Location_ID are not tightly coupled (e.g. there could be Location_ID = 4, Location_Name = Place 1 and Location_ID = 4, Location_Name = Place 2 then you're going to have to define how to determine which place to display if you group by Location_ID, or admit that perhaps one of those columns is meaningless.
If Location_Name and Location_ID are tightly coupled, they shouldn't both be stored in this table. You should have a lookup/dimension table that stores both of those columns (once!) and you use the smallest data type as the key you record in the fact table (where it is repeated over and over again). This has several benefits:
Scanning the bigger table is faster, because it's not as wide
Storage is reduced because you're not repeating long strings over and over and over again
Aggregation is clearer and you can join to get names after aggregation, which will be faster
If you need to change a location's name, you only need to change it in exactly one place
Sample code
CREATE TABLE People_Location
(
Name VARCHAR(30) NOT NULL,
Location_Name VARCHAR(30) NOT NULL,
Location_ID INT NOT NULL,
)
INSERT INTO People_Location
VALUES
('John Fox', 'Moon', 4),
('John Bear', 'Moon', 4),
('Peter', 'Saturn', 5),
('John Fox', 'Moon', 4),
('Micheal', 'Sun', 1),
('Jackie', 'Sun', 1),
('Tito', 'Sun', 1),
('Peter', 'Saturn', 5)
Get location and count
select Location_Name, count(1)
from
(select Name, Location_Name,
rn = ROW_NUMBER() OVER (PARTITION BY Name, Location_ID ORDER BY Name)
from People_Location
) t
where rn = 1
group by Location_Name
Result
Moon 2
Saturn 1
Sun 3
I am currently working on a table with approx. 7.5mio rows and 16 columns. One of the rows is an internal identifier (let's call it ID) we use at my university. Another column contains a string.
So, ID is NOT the unique index for a row, so it is possible that one identifier appears more than once in the table - the only difference between the two rows being the string.
I need to find all rows with ID and just keep the one with the longest string and deleting every other row from the original table. Unfortunately I am more of a SQL Novice, and I am really stuck at this point. So if anyone could help, this would be really nice.
Take a look at this sample:
SELECT * INTO #sample FROM (VALUES
(1, 'A'),
(1,'Long A'),
(2,'B'),
(2,'Long B'),
(2,'BB')
) T(ID,Txt)
DELETE S FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY LEN(Txt) DESC) RN
FROM #sample) S
WHERE RN!=1
SELECT * FROM #sample
Results:
ID Txt
-- ------
1 Long A
2 Long B
It might be possible just in SQL, but the way I know how to do it would be a two-pass approach using application code - I assume you have an application you are writing.
The first pass would be something like:
SELECT theid, count(*) AS num, MAX(LEN(thestring)) AS keepme FROM thetable WHERE num > 1 GROUP BY theid
Then you'd loop through the results in whatever language you're using and delete anything with that ID except the one matching the string returned. The language I know is PHP, so I'll use it for my example, but the method would be the same in any language (for brevity, I'm skipping error checking, prepared statements, and such, and not testing - please use carefully):
$sql = 'SELECT theid, count(*) AS num, MAX(LEN(thestring)) AS keepme FROM thetable WHERE num > 1 GROUP BY theid';
$result = sqlsrv_query($resource, $sql);
while ($row = sqlsrv_fetch_object($result)) {
$sql = 'DELETE FROM thetable WHERE theid = '.$row->theid.' AND NOT thestring = '.$row->keepme;
$result = sqlsrv_query($resource, $sql);
}
You didn't say what you would want to do if two strings are the same length, so this solution does not deal with that at all - I'm assuming that each ID will only have one longest string.
I'm trying to solve an issue and have ended up on the rocks...
Data looks like this:
ID, Product_Code, Amount
123, Sandwich_Ham, 2.50
234, Sandwich_Egg, 2.75
123, Sandwich_Mustard, 0.50
123, Coffee, 1.50
456, Sandwich_Beef, 3.75
234, Water, 1.25
456, Sandwich_Horseradish, 0.75
456, Beer, 4.75
Data to be returned to a Crystal Report.
ID, Product_Code, Amount
123, Sandwich_Ham, 3.00
123, Coffee, 1.50
234, Sandwich_Egg, 2.75
234, Water, 1.25
456, Sandwich_Beef, 4.50
456, Beer, 4.75
As you can see in the example above, I need to add together (sum) the amount column for products that have match on the beginning of the Product_Code field and return just one row with the total amount in the Amount column (instead of the two) All other rows pass-through as-is.
For a given ID, there are only two rows that would have matching prefix (in effect only one condiment allowed per sandwich), max one sandwich per person.
The Crystal report filters by ID and has to display every line for a given ID, so no way I can just sum the values in CR. (I've tried.)
I created a scalar-valued function that returns the total amount when passed an ID, but can't figure out how to get that to would for an entire table. Based on my reading, a table-valued function won't help.
My thought is I could write a stored procedure to loop through the table, call the scalar function to handle the case where a product code begins with 'Sandwich_' and write everything out to a temp table. I would use a view to call the stored procedure and return the rows from the temp table. Then modify the report in CR to use the view. Performance could be a big issue with this plus it seems to rather complex for dealing with a 'small change' to an existing report.
Any suggestions on other methods for solving this (before I go another rat hole...)?
Use LEFT to get the grouping and then do a simple SUM and GROUP BY:
WITH Cte AS(
SELECT *,
grp = CASE
WHEN CHARINDEX('_', Product_Code) = 0 THEN Product_Code
ELSE LEFT(Product_Code, CHARINDEX('_', Product_Code) - 1)
END
FROM #Tbl
)
SELECT
ID,
Product_Code = MAX(Product_Code),
Amount = SUM(Amount)
FROM Cte
GROUP BY
ID, grp
ORDER BY Id;
ONLINE DEMO
try this as Select in CR
Select id, case when charIndex('_', product_Code) > 0
then left(product_Code, charIndex('_', product_Code)
else product_Code end ProductCategory,
sum (Amount) totalAmt
From myTable
Group By id, case when charIndex('_', product_Code) > 0
then left(product_Code, charIndex('_', product_Code)
else product_Code end
once again I really need your expertise.It looks like I am not ranking well here...
I have two records returned and they are both ranked one even though their date time is slightly different. by some minutes.Thanks guys , I really appreciate your help as always.
SELECT CD.MEMACT,
CD.DATETIME,--DATETIME
CD.AG_ID,
RANK() OVER (PARTITION BY
CD.MEMACT,
CD.DATETIME,
CD.AG_ID
ORDER BY CD.DATETIME)RANKED
FROM MEM_ACT_TBL
WHERE CD.MEMACT='1024518'
You provided this sample data as violating your expectations:
MEMACT DATETIME AG_ID RANK
------- --------------- ------- ----
1024518 12/26/2013 7:43 Ag_1541 1
1024518 12/26/2013 7:53 Ag_2488 1
But in your example code, you PARTITION BY CD.MEMACT, CD.DATETIME, CD.AG_ID. This explicitly means to treat all rows that do not share these values as completely separate ranking series. Since the two rows you presented above do not have the same MEMACT, DATETIME, and AG_ID values, then they each start their ranking from 1.
If you were expecting the two rows above to ascend in rank, then you must not partition by anything but MEMACT, since only that column is the same between the rows. You would then most likely order by DATETIME and then AG_ID (to get determinism if you have two rows with the same date) like so:
PARTITION BY CD.MEMACT ORDER BY CD.DATETIME, CD.AG_ID
It never makes sense in any of the ranking functions to both PARTITION BY and ORDER BY the same column as this will ensure that the ORDER BY has no differing data to work with as you've already segregated all the separate series by the same column.
Without seeing your data I could say it is due to the partitioning method. Your 'partitioning' statement is essentially WHAT you group by to determine number position resetting. The 'order by' determines the sequence of HOW you are ordering that data. See this simple example where in my first windowed function I use one too many partition by's and thus just show a whole bunch of redundant ones.
Simple self extracting example:
declare #Person Table ( personID int identity, person varchar(8));
insert into #Person values ('Brett'),('Sean'),('Chad'),('Michael'),('Ray'),('Erik'),('Queyn');
declare #Orders table ( OrderID int identity, PersonID int, Desciption varchar(32), Amount int);
insert into #Orders values (1, 'Shirt', 20),(1, 'Shoes', 50),(2, 'Shirt', 22),(2, 'Shoes', 52),(3, 'Shirt', 20),(3, 'Shoes', 50),(3, 'Hat', 20),(4, 'Shirt', 20),(5, 'Shirt', 20),(5, 'Pants', 30),
(6, 'Shirt', 20),(6, 'RunningShoes', 70),(7, 'Shirt', 22),(7, 'Shoes', 40),(7, 'Coat', 80)
Select
p.person
, o.Desciption
, o.Amount
, row_number() over(partition by person, Desciption order by Desciption) as [Wrong I Used Too Many Partitions]
, row_number() over(partition by person order by Desciption) as [Correct By Alpha of Description]
, row_number() over(partition by person order by Amount desc) as [Correct By Amount of Highest First]
from #Person p
join #Orders o on p.personID = o.PersonID
I would guess you could do this instead:
SELECT
CD.MEMACT,
CD.DATETIME,--DATETIME
CD.AG_ID,
RANK() OVER (PARTITION BY CD.MEMACT ORDER BY CD.DATETIME, CD.AG_ID) RANKED
FROM MEM_ACT_TBL
WHERE CD.MEMACT='1024518'
I want to select all records, but have the query only return a single record per Product Name. My table looks similar to:
SellId ProductName Comment
1 Cake dasd
2 Cake dasdasd
3 Bread dasdasdd
where the Product Name is not unique. I want the query to return a single record per ProductName with results like:
SellId ProductName Comment
1 Cake dasd
3 Bread dasdasdd
I have tried this query,
Select distict ProductName,Comment ,SellId from TBL#Sells
but it is returning multiple records with the same ProductName. My table is not realy as simple as this, this is just a sample. What is the solution? Is it clear?
Select ProductName,
min(Comment) , min(SellId) from TBL#Sells
group by ProductName
If y ou only want one record per productname, you ofcourse have to choose what value you want for the other fields.
If you aggregate (using group by) you can choose an aggregate function,
htat's a function that takes a list of values and return only one : here I have chosen MIN : that is the smallest walue for each field.
NOTE : comment and sellid can come from different records, since MIN is taken...
Othter aggregates you might find useful :
FIRST : first record encountered
LAST : last record encoutered
AVG : average
COUNT : number of records
first/last have the advantage that all fields are from the same record.
SELECT S.ProductName, S.Comment, S.SellId
FROM
Sells S
JOIN (SELECT MAX(SellId)
FROM Sells
GROUP BY ProductName) AS TopSell ON TopSell.SellId = S.SellId
This will get the latest comment as your selected comment assuming that SellId is an auto-incremented identity that goes up.
I know, you've got an answer already, I'd like to offer a way that was fastest in terms of performance for me, in a similar situation. I'm assuming that SellId is Primary Key and identity. You'd want an index on ProductName for best performance.
select
Sells.*
from
(
select
distinct ProductName
from
Sells
) x
join
Sells
on
Sells.ProductName = x.ProductName
and Sells.SellId =
(
select
top 1 s2.SellId
from
Sells s2
where
x.ProductName = s2.ProductName
Order By SellId
)
A slower method, (but still better than Group By and MIN on a long char column) is this:
select
*
from
(
select
*,ROW_NUMBER() over (PARTITION BY ProductName order by SellId) OccurenceId
from sells
) x
where
OccurenceId = 1
An advantage of this one is that it's much easier to read.
create table Sale
(
SaleId int not null
constraint PK_Sale primary key,
ProductName varchar(100) not null,
Comment varchar(100) not null
)
insert Sale
values
(1, 'Cake', 'dasd'),
(2, 'Cake', 'dasdasd'),
(3, 'Bread', 'dasdasdd')
-- Option #1 with over()
select *
from Sale
where SaleId in
(
select SaleId
from
(
select SaleId, row_number() over(partition by ProductName order by SaleId) RowNumber
from Sale
) tt
where RowNumber = 1
)
order by SaleId
-- Option #2
select *
from Sale
where SaleId in
(
select min(SaleId)
from Sale
group by ProductName
)
order by SaleId
drop table Sale