Split string of variable length, variable delimiters - sql-server

Read the question posed here, but mine is a little more complicated.
I have a string that is variable in length, and the delimiter can sometimes be two dashes, or sometimes it can be just one. Let's say in my table the data that I want to break out is stored in a single column like this:
+ -----------------------------------------+
| Category |
+------------------------------------------+
| Zoo - Animals - Lions |
| Zoo - Personnel |
| Zoo - Operating Costs - Power / Cooling |
+------------------------------------------+
But I want to output the data string from that single column into three separate columns like this:
+----------+--------------------+-----------------+
| Location | Category | Sub-Category |
+----------+--------------------+-----------------+
| Zoo | Animals | Lions |
| Zoo | Personnel | |
| Zoo | Operating Costs | Power / Cooling |
+----------+--------------------+-----------------+
Hoping for some guidance as the samples I've been finding on Google seem to be simpler than this.

You can also use a string splitter. Here is an excellent one that works with your version. DelimitedSplit8K
Now we need some sample data.
declare #Something table
(
Category varchar(100)
)
insert #Something values
('Zoo - Animals - Lions')
, ('Zoo - Personnel')
, ('Zoo - Operating Costs - Power / Cooling')
Now that we have a function and sample data the code for this is quite nice and tidy.
select s.Category
, Location = max(case when x.ItemNumber = 1 then Item end)
, Category = max(case when x.ItemNumber = 2 then Item end)
, SubCategory = max(case when x.ItemNumber = 3 then Item end)
from #Something s
cross apply dbo.DelimitedSplit8K(s.Category, '-') x
group by s.Category
And this will return:
Category |Location|Category |SubCategory
Zoo - Animals - Lions |Zoo |Animals |Lions
Zoo - Operating Costs - Power / Cooling |Zoo |Operating Costs|Power / Cooling
Zoo - Personnel |Zoo |Personnel |NULL

Here is a solution that uses solely string functions:
select
left(
category,
charindex('-', category) - 2
) location,
substring(
category,
charindex('-', category) + 2,
len(category) - charindex('-', category, charindex('-', category) + 1)
) category,
case when charindex('-', category, charindex('-', category) + 1) > 0
then right(category, charindex('-', reverse(category)) - 2)
end sub_category
from t
Demo on DB Fiddle:
location | category | sub_category
:------- | :--------------- | :--------------
Zoo | Animal | Lions
Zoo | Personnel | null
Zoo | Operating Costs | Power / Cooling

You've tagged this with [sql-server-2017]. That means, that you can use JSON-support (this was introduced with v2016).
Currently JSON is the best built-in approach for position- and type-safe string splitting:
A mockup, to simulate your issue
DECLARE #mockup TABLE (ID INT IDENTITY, Category VARCHAR(MAX))
INSERT INTO #mockup (Category)
VALUES ('Zoo - Animals - Lions')
,('Zoo - Personnel')
,('Zoo - Operating Costs - Power / Cooling');
--The query
SELECT t.ID
,A.[Location]
,A.Category
,A.subCategory
FROM #mockup t
CROSS APPLY OPENJSON(CONCAT('[["',REPLACE(t.Category,'-','","'),'"]]'))
WITH ([Location] VARCHAR(MAX) '$[0]'
,Category VARCHAR(MAX) '$[1]'
,SubCategory VARCHAR(MAX) '$[2]') A;
The result (might need some TRIM()ing)
ID Location Category subCategory
1 Zoo Animals Lions
2 Zoo Personnel NULL
3 Zoo Operating Costs Power / Cooling
The idea in short:
We use some simple string operations to transform your string into a JSON array:
a b c => [["a","b","c"]]
Now we can use OPENJSON() together with a WITH-clause to return each fragment by its position with a fixed type.

Bit of hack, but it works:
DECLARE #t TABLE (Category VARCHAR(255))
INSERT #t (Category)
VALUES ('Zoo - Animals - Lions'),('Zoo - Personnel'),('Zoo - Operating Costs - Power / Cooling')
;WITH split_vals AS (
SELECT Category AS Cat,TRIM(Value) AS Value,ROW_NUMBER() OVER (PARTITION BY Category ORDER BY Category) AS RowNum
FROM #t
CROSS APPLY STRING_SPLIT(Category,'-')
), cols AS (
SELECT
Cat,
CASE WHEN RowNum = 1 THEN Value END AS Location,
CASE WHEN RowNum = 2 THEN Value END AS Category,
CASE WHEN RowNum = 3 THEN Value END AS [Sub-Category]
FROM split_vals
)
SELECT STRING_AGG(Location, '') AS Location,
STRING_AGG(Category, '') AS Category,
STRING_AGG([Sub-Category], '') AS [Sub-Category]
FROM cols
GROUP BY Cat;

Related

How can I do a custom order in Snowflake?

In Snowflake, how do I define a custom sorting order.
ID Language Text
0 ENU a
0 JPN b
0 DAN c
1 ENU d
1 JPN e
1 DAN f
2 etc...
here I want to return all rows sorted by Language in this order: Language = ENU comes first, then JPN and lastly DAN.
Is this even possible?
I would like to order by language, in this order: ENU, JPN, DNA, and so on: ENU, JPN, DNA,ENU,JPN, DAN,ENU, JPN, DAN
NOT: ENU,ENU,ENU,JPN,JPN,JPN,DAN,DAN,DAN
I liked array_position solution of Phil Coulson. It's also possible to use DECODE:
create or replace table mydata ( ID number, Language varchar, Text varchar )
as select * from values
(0, 'JPN' , 'b'),
(0, 'DAN' , 'c' ),
(0, 'ENU' , 'a'),
(1 , 'JPN' , 'e'),
(1 , 'ENU' , 'd'),
(1 , 'DAN' , 'f');
select * from
mydata order by ID, DECODE(Language,'ENU',0,'JPN',1,'DAN',2 );
+----+----------+------+
| ID | LANGUAGE | TEXT |
+----+----------+------+
| 0 | ENU | a |
| 0 | JPN | b |
| 0 | DAN | c |
| 1 | ENU | d |
| 1 | JPN | e |
| 1 | DAN | f |
+----+----------+------+
You basically need 2 levels of sort. I am using arrays to arrange the languages in the order I want and then array_position to assign every language an index based on which they will be sorted. You can achieve the same using either a case expression or decode. To make sure the languages don't repeat within the same id, we use row_number. You can comment out the the row_number() line if that's not a requirement
with cte (id, lang) as
(select 0,'JPN' union all
select 0,'ENU' union all
select 0,'DAN' union all
select 0,'ENU' union all
select 0,'JPN' union all
select 0,'DAN' union all
select 1,'JPN' union all
select 1,'ENU' union all
select 1,'DAN' union all
select 1,'ENU' union all
select 1,'JPN' union all
select 1,'DAN')
select *
from cte
order by id,
row_number() over (partition by id, array_position(lang::variant,['ENU','JPN','DAN']) order by lang), --in case you want languages to not repeat within each id
array_position(lang::variant,['ENU','JPN','DAN'])
This is not to say other answers are wrong.
But here's yet another not using ANSI SQL 'CASE':
SELECT * FROM "Example"
ORDER BY
CASE "Language" WHEN 'ENU' THEN 1
WHEN 'JPN' THEN 2
WHEN 'DAN' THEN 3
ELSE 4 END
,"Language";
Notice the "Language" code is used as a disambiguation for 'other' languages not specified.
It's good defensive programming when dealing with CASE to deal with ELSE.
The ultimate most flexible answer is to have a table with a collation order for languages in it.
Collation order columns are common in many applications.
I've seen them for things like multiple parties to a contract who should appear in a specified order to (of course) the positional order of columns in a table of metadata.

Extract Quantity and price from text

I have these data
CREATE TABLE #Items (ID INT , Col VARCHAR(300))
INSERT INTO #Items VALUES
(1, 'Dave sold 10 items are sold to ABC servercies at 2.50 each'),
(2, '21 was sold to Tray Limited 3.90 each'),
(3, 'Consulting ordered 15 at 7.11 per one'),
(4, 'Returns from Murphy 7 at a cost of 6.10 for each item')
from the Col i want to extract Quantity and Price
I have written the below query which extract the quantity
SELECT
ID,
Col,
LEFT(SUBSTRING(Col, PATINDEX('%[0-9]%', Col), LEN(Col)),2) AS Qty
FROM #Items
my difficulty is that i don't how i can extract the Pice.
Expected output
You were told already, that storing values within such a string is a real no-no-go.
But - if you have to deal with external input - you might try this:
DECLARE #items TABLE(ID INT , Col VARCHAR(300))
INSERT INTO #items VALUES
(1, 'Dave sold 10 items are sold to ABC servercies at 2.50 each'),
(2, '21 was sold to Tray Limited 3.90 each'),
(3, 'Consulting ordered 15 at 7.11 per one'),
(4, 'Returns from Murphy 7 at a cost of 6.10 for each item');
SELECT i.ID
,i.Col
,A.Casted.value('/x[not(empty(. cast as xs:int?))][1]','int') AS firstNumberAsInt
,A.Casted.value('/x[not(empty(. cast as xs:decimal?))][2]','decimal(10,4)') AS SecondNumberAsDecimal
FROM #items i
CROSS APPLY(SELECT CAST('<x>' + REPLACE((SELECT i.Col AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)) A(Casted);
The idea in short:
we use some string methods to transform your string into XML, where each word is within it's own <x>-element.
We use XML-XQuery's abilities to pick only nodes which answer a predicate.
We use the predicate not(empty(. cast as someType)). This will return an element only in cases, where its content can be casted. Any other element is omitted.
The result:
+----+------------------------------------------------------------+------------------+-----------------------+
| ID | Col | firstNumberAsInt | SecondNumberAsDecimal |
+----+------------------------------------------------------------+------------------+-----------------------+
| 1 | Dave sold 10 items are sold to ABC servercies at 2.50 each | 10 | 2.5000 |
+----+------------------------------------------------------------+------------------+-----------------------+
| 2 | 21 was sold to Tray Limited 3.90 each | 21 | 3.9000 |
+----+------------------------------------------------------------+------------------+-----------------------+
| 3 | Consulting ordered 15 at 7.11 per one | 15 | 7.1100 |
+----+------------------------------------------------------------+------------------+-----------------------+
| 4 | Returns from Murphy 7 at a cost of 6.10 for each item | 7 | 6.1000 |
+----+------------------------------------------------------------+------------------+-----------------------+
I'm sure you know that there are millions of cases where this kind of parsing will break...
First things first: DON'T store things like that in a DB and expect to be able just "extract" data. I can give you a solution given the data you have, but it's going to fall down pretty quickly if anyone enters something silly, for example "Sold ice creams 1.50 each x 10" or "Bought 5 sorbets total 20".
What we will do is use CROSS APPLY in series to calculate the positions of each number.
SELECT
ID,
Col,
CAST(SUBSTRING(Col, FirstNum, EndFirst - 1) AS int) AS Qty,
CAST(SUBSTRING(Col, FirstNum + EndFirst + SecondNum - 2, EndSecond) AS decimal(18,2)) AS Price
FROM #Items
CROSS APPLY (VALUES (PATINDEX('%[0-9]%', Col) ) ) v1(FirstNum)
CROSS APPLY (VALUES (PATINDEX('%[^0-9]%', SUBSTRING(Col, FirstNum, LEN(Col))) ) ) v2(EndFirst)
CROSS APPLY (VALUES (PATINDEX('%[0-9.]%', SUBSTRING(Col, FirstNum + EndFirst - 1, LEN(Col))) ) ) v3(SecondNum)
CROSS APPLY (VALUES (PATINDEX('%[^0-9.]%', SUBSTRING(Col, FirstNum + EndFirst - 1 + SecondNum, LEN(Col))) ) ) v4(EndSecond)

Find biggest order by client in Microsoft SQL Server

I am looking for some elegant solution to find preferred channel by client.
As an input we get list of transactions, which contains clientid, date, invoice_id, channel and amount. For every client we need to find preferred channel based on amount.
In case some specific client has 2 channels - outcome should be RANDOM among those channels.
Input data:
Clients ID | Date | Invoice Id | Channel | Amount
-----------+------------+------------+---------+--------
Client #1 | 01-01-2020 | 0000000001 | Retail | 90
Client #1 | 07-01-2020 | 0000000002 | Website | 180
Client #2 | 08-01-2020 | 0000000003 | Retail | 70
Client #2 | 09-01-2020 | 0000000004 | Website | 70
Client #3 | 10-01-2020 | 0000000005 | Retail | 140
Client #4 | 11-01-2020 | 0000000006 | Retail | 70
Client #4 | 13-01-2020 | 0000000007 | Website | 30
Desired output:
Clients ID | Top-Channel
-----------+-----------------
Client #1 | Website >> website 180 > retail 90
Client #2 | Retail >> random choice from Retail and Website
Client #3 | Retail >> retail 140 > website 0
Client #4 | Retail >> retail 70 > website 30
Usually to solve such tasks I do some manipulations with GROUP BY, add a random number which is less than 1, and many other tricks. But most probably, there is a better solution.
This is for Microsoft SQL Server
If you have the totals, then you can use window functions:
select t.*
from (select t.*,
row_number() over (partition by client_id order by amount desc) as seqnum
from t
) t
where seqnum = 1;
If you need to aggregate to get the totals, the same approach works with aggregation:
select t.*
from (select t.client_id, t.channel, sum(amount) as total_amount,
row_number() over (partition by client_id order by sum(amount) desc) as seqnum
from t
group by t.client_id, t.channel
) t
where seqnum = 1;
So keeping your desired output in mind, I wrote the following T-SQL without using group by
declare #clients_id int =1
declare #clients_id_max int = (select max(clients_id) from random)
declare #tab1 table (clients_id int, [Top-Channel] nvarchar(10), amount int)
declare #tab2 table (clients_id int, remarks nvarchar (100))
while #clients_id <= #clients_id_max
begin
if ((select count(*) from random where clients_id =#clients_id) > 1)
begin
insert into #tab2 select top 1 a.clients_id, a.channel +' '+ cast (a.amount as nvarchar(5)) +' ; '+ b.channel +' '+ cast (b.amount as nvarchar(5)) as remarks
from random a, random b where a.clients_id =#clients_id and a.clients_id = b.clients_id and a.channel <> b.channel
order by a.amount desc
end
else
begin
insert into #tab2 select a.clients_id, a.channel +' '+ cast (a.amount as nvarchar(5)) as remarks
from random a, random b where a.clients_id =#clients_id and a.clients_id = b.clients_id
order by a.amount desc
end
insert into #tab1 select top 1 clients_id, Channel as [Top-Channel], amount from random where clients_id = #clients_id order by amount desc
set #clients_id = #clients_id +1
end
select a.clients_id, a.[Top-Channel], b.Remarks from #tab1 a join #tab2 b on a.clients_id = b.clients_id
[Query Output
: https://i.stack.imgur.com/7RzcV.jpg ]
This will work:
select distinct ID,FIRST_VALUE(Channel) over (partition by ID order by amount desc,NEWID())
from Table1

Using STRING_SPLIT for 2 columns in a single table

I've started from a table like this
ID | City | Sales
1 | London,New York,Paris,Berlin,Madrid| 20,30,,50
2 | Istanbul,Tokyo,Brussels | 4,5,6
There can be an unlimited amount of cities and/or sales.
I need to get each city and their salesamount their own record. So my result should look something like this:
ID | City | Sales
1 | London | 20
1 | New York | 30
1 | Paris |
1 | Berlin | 50
1 | Madrid |
2 | Istanbul | 4
2 | Tokyo | 5
2 | Brussels | 6
What I got so far is
SELECT ID, splitC.Value, splitS.Value
FROM Table
CROSS APLLY STRING_SPLIT(Table.City,',') splitC
CROSS APLLY STRING_SPLIT(Table.Sales,',') splitS
With one cross apply, this works perfectly. But when executing the query with a second one, it starts to multiply the number of records a lot (which makes sense I think, because it's trying to split the sales for each city again).
What would be an option to solve this issue? STRING_SPLIT is not neccesary, it's just how I started on it.
STRING_SPLIT() is not an option, because (as is mentioned in the documantation) the output rows might be in any order and the order is not guaranteed to match the order of the substrings in the input string.
But you may try with a JSON-based approach, using OPENJSON() and string transformation (comma-separated values are transformed into a valid JSON array - London,New York,Paris,Berlin,Madrid into ["London","New York","Paris","Berlin","Madrid"]). The result from the OPENJSON() with default schema is a table with columns key, value and type and the key column is the 0-based index of each item in this array:
Table:
CREATE TABLE Data (
ID int,
City varchar(1000),
Sales varchar(1000)
)
INSERT INTO Data
(ID, City, Sales)
VALUES
(1, 'London,New York,Paris,Berlin,Madrid', '20,30,,50'),
(2, 'Istanbul,Tokyo,Brussels', '4,5,6')
Statement:
SELECT d.ID, a.City, a.Sales
FROM Data d
CROSS APPLY (
SELECT c.[value] AS City, s.[value] AS Sales
FROM OPENJSON(CONCAT('["', REPLACE(d.City, ',', '","'), '"]')) c
LEFT OUTER JOIN OPENJSON(CONCAT('["', REPLACE(d.Sales, ',', '","'), '"]')) s
ON c.[key] = s.[key]
) a
Result:
ID City Sales
1 London 20
1 New York 30
1 Paris
1 Berlin 50
1 Madrid NULL
2 Istanbul 4
2 Tokyo 5
2 Brussels 6
STRING_SPLIT has no context of what oridinal positions are. In fact, the documentation specifically states that it doesn't care about it:
The order of the output may vary as the order is not guaranteed to match the order of the substrings in the input string.
As a result, you need to use something that is aware of such basic things, such as DelimitedSplit8k_LEAD.
Then you can do something like this:
WITH Cities AS(
SELECT ID,
DSc.Item,
DSc.ItemNumber
FROM dbo.YourTable YT
CROSS APPLY dbo.DelimitedSplit8k_LEAD(YT.City,',') DSc)
Sales AS(
SELECT ID,
DSs.Item,
DSs.ItemNumber
FROM dbo.YourTable YT
CROSS APPLY dbo.DelimitedSplit8k_LEAD(YT.Sales,',') DSs)
SELECT ISNULL(C.ID,S.ID) AS ID,
C.Item AS City,
S.Item AS Sale
FROM Cities C
FULL OUTER JOIN Sales S ON C.ItemNumber = S.ItemNumber;
Of course, however, the real solution is fix your design. This type of design is going to only cause you 100's of problems in the future. Fix it now, not later; you'll reap so many rewards sooner the earlier you do it.

T-SQL - Finding records with chronological gaps

This is my first post here. I'm still a novice SQL user at this point though I've been using it for several years now. I am trying to find a solution to the following problem and am looking for some advice, as simple as possible, please.
I have this 'recordTable' with the following columns related to transactions; 'personID', 'recordID', 'item', 'txDate' and 'daySupply'. The recordID is the primary key. Almost every personID should have many distinct recordID's with distinct txDate's.
My focus is on one particular 'item' for all of 2017. It's expected that once the item daySupply has elapsed for a recordID that we would see a newer recordID for that person with a more recent txDate somewhere between five days before and five days after the end of the daySupply.
What I'm trying to uncover are the number of distinct recordID's where there wasn't an expected new recordID during this ten day window. I think this is probably very simple to solve but I am having a lot of difficulty trying to create a query for it, let alone explain it to someone.
My thought thus far is to create two temp tables. The first temp table stores all of the records associated with the desired items and I'm just storing the personID, recordID and txDate columns. The second temp table has the personID, recordID and the two derived columns from the txDate and daySupply; these would represent the five days before and five days after.
I am trying to find some way to determine the number of recordID's from the first table that don't have expected refills for that personID in the second. I thought a simple EXCEPT would do this but I don't think there's anyway of getting around a recursive type statement to answer this and I have never gotten comfortable with recursive queries.
I searched Stackoverflow and elsewhere but couldn't come up with an answer to this one. I would really appreciate some help from some more clever data folks. Here is the code so far. Thanks everyone!
CREATE TABLE #temp1 (personID VARCHAR(20), recordID VARCHAR(10), txDate
DATE)
CREATE TABLE #temp2 (personID VARCHAR(20), recordID VARCHAR(10), startDate
DATE, endDate DATE)
INSERT INTO #temp1
SELECT [personID], [recordID], txDate
FROM recordTable
WHERE item = 'desiredItem'
AND txDate > '12/31/16'
AND txDate < '1/1/18';
INSERT INTO #temp2
SELECT [personID], [recordID], (txDate + (daySupply - 5)), (txDate +
(daySupply + 5))
FROM recordTable
WHERE item = 'desiredItem'
AND txDate > '12/31/16'
AND txDate < '1/1/18';
I agree with mypetlion that you could have been more concise with your question, but I think I can figure out what you are asking.
SQL Window Functions to the rescue!
Here's the basic idea...
CREATE TABLE #fills(
personid INT,
recordid INT,
item NVARCHAR(MAX),
filldate DATE,
dayssupply INT
);
INSERT #fills
VALUES (1, 1, 'item', '1/1/2018', 30),
(1, 2, 'item', '2/1/2018', 30),
(1, 3, 'item', '3/1/2018', 30),
(1, 4, 'item', '5/1/2018', 30),
(1, 5, 'item', '6/1/2018', 30)
;
SELECT *,
ABS(
DATEDIFF(
DAY,
LAG(DATEADD(DAY, dayssupply, filldate)) OVER (PARTITION BY personid, item ORDER BY filldate),
filldate
)
) AS gap
FROM #fills
ORDER BY filldate;
... outputs ...
+----------+----------+------+------------+------------+------+
| personid | recordid | item | filldate | dayssupply | gap |
+----------+----------+------+------------+------------+------+
| 1 | 1 | item | 2018-01-01 | 30 | NULL |
| 1 | 2 | item | 2018-02-01 | 30 | 1 |
| 1 | 3 | item | 2018-03-01 | 30 | 2 |
| 1 | 4 | item | 2018-05-01 | 30 | 31 |
| 1 | 5 | item | 2018-06-01 | 30 | 1 |
+----------+----------+------+------------+------------+------+
You can insert the results into a temp table and pull out only the ones you want (gap > 5), or use the query above as a CTE and pull out the results without the temp table.
This could be stated as follows: "Given a set of orders, return a subset for which there is no order within +/- 5 days of the expected resupply date (defined as txDate + DaysSupply)."
This can be solved simply with NOT EXISTS. Define the range of orders you wish to examine, and this query will find the subset of those orders for which there is no resupply order (NOT EXISTS) within 5 days of either side of the expected resupply date (txDate + daysSupply).
SELECT
gappedOrder.personID
, gappedOrder.recordID
, gappedOrder.item
, gappedOrder.txDate
, gappedOrder.daysSupply
FROM
recordTable as gappedOrder
WHERE
gappedOrder.item = 'desiredItem'
AND gappedOrder.txDate > '12/31/16'
AND gappedOrder.txDate < '1/1/18'
--order not refilled within date range tolerance
AND NOT EXISTS
(
SELECT
1
FROM
recordTable AS refilledOrder
WHERE
refilledOrder.personID = gappedOrder.personID
AND refilledOrder.item = gappedOrder.item
--5 days prior to (txDate + daysSupply)
AND refilledOrder.txtDate >= DATEADD(day, -5, DATEADD(day, gappedOrder.daysSupply, gappedOrder.txDate))
--5 days after (txtDate + daysSupply)
AND refilledOrder.txtDate <= DATEADD(day, 5, DATEADD(day, gappedOrder.daysSupply, gappedOrder.txtDate))
);

Resources