How can I do natural sort in Snowflake? - snowflake-cloud-data-platform

I am trying to sort Jira stories with several potential structures: 3 to 5 characters followed by dash followed by 3 or 4 digits e.g. XXX-###, XXX-####, XXXX-### etc.

I'm not quite sure what you mean by a natural sort, but you can sort by a regex. For example, to get just the number of a JiraID string.
--This Returns '123'
select REGEXP_SUBSTR( 'xx-123' , '\\d+');
--This Returns 'xx'
select REGEXP_SUBSTR( 'xx-123' , '\\w+');
--So this would sort by just the number in a Jira number column based on the number part
... order by REGEXP_SUBSTR( JiraID , '\\d+')::int;
--This would order first by IDs with the same string, and then order by Number
... order by REGEXP_SUBSTR( JiraID , '\\w+'), REGEXP_SUBSTR( JiraID , '\\d+')::int;
More complex examples are possible if you provide a sample input with a desired output ordering

Related

Filter repeating two digits pattern in SQL

I would like to filter some string numbers that have repeated two digits more than 3 times. For example, if it contains "121212", it will not be filtered, but if it contains "12121212" it will be filtered. I am trying to find a way to solve it using regular expression but could not find it.
In Oracle, use a regular expression to match two digits and then use a back-reference to check that the digits repeat 3-or-more times. Then to exclude those rows add NOT to the filter:
SELECT *
FROM table_name
WHERE NOT REGEXP_LIKE(value, '(\d\d)\1{3,}')
Which, for the sample data:
CREATE TABLE table_name (value) AS
SELECT '121212' FROM DUAL UNION ALL
SELECT '12121212' FROM DUAL UNION ALL
SELECT '123121212' FROM DUAL;
Outputs:
VALUE
121212
123121212
fiddle

Counting phonetically similar words in a column in SQL Server table

I have a SQL Server table of phrases inserted into it. I want a count of similar sounding (phonetically) phrases (entire text) and show a count. I was able to do that using Soundex using the following. It works if they are all single words perfectly but if I introduce phrases it is only matching the first word in the phrase.
Example:
Beatles, Baetles, Beetles - count 3
The Beatles, The Rolling Stones - count 2 (when I am expecting it to be 1 each).
Here is my T-SQL
select top 3
count(*) as count,
dbo.getWord(soundex(word)) as word,
soundex(word)
from
words
group by
soundex(word)
order by
count desc;
getWord function:
CREATE FUNCTION [dbo].[getWord]
(#code nvarchar(4))
RETURNS nvarchar(50)
AS
BEGIN
DECLARE #return_value nvarchar(50);
SELECT TOP 1 #return_value = word
FROM words
WHERE soundex(word) = #code
RETURN #return_value
END;
So this solution does not work, I tried to implement another function that checks the Damerau-Levenshtein algorithm in T-SQL which takes 2 words and max distance but how can I use that for my needs.
Any suggestion or solution to a similar problem would be appreciated.
Though not explicitly documented, SOUNDEX is intended to be used against words not phrases. For your values like 'Beatles' and 'Beetles' it therefore works as you expect, however, for 'The Rolling Stones' and 'The Beatles' it doesn't. Instead SOUNDEX is returning the "sound" for the first word in the string, which for both is 'The', and so the SOUNDEX for both is 'T000'
What you could do is remove the spaces from the strings, to treat it as one word:
SELECT SOUNDEX(REPLACE(V.String,' ','')) AS Sound,
V.String,
COUNT(*) OVER (PARTITION BY SOUNDEX(REPLACE(V.String,' ',''))) AS [Count]
FROM (VALUES('The Rolling Stones'),
('The Beatles'),
('Beetles'),
('Baetles'),
('Beatles')) V(String);
This returns the following dataset:
Sound String Count
----- ------------------ -----------
B342 Baetles 3
B342 Beetles 3
B342 Beatles 3
T134 The Beatles 1
T645 The Rolling Stones 1
An alternative approach would be to get the sound of each word and then aggregate the returned sounds, and group on those. In any on premises versions of SQL Server you will need to use something other than STRING_SPLIT and use a user defined function or method that returns the ordinal position. If you are on an Azure SQL Database (and hopefully SQL Server 2022+) then STRING_SPLIT has an Ordinal parameter you can use.
I use STRING_SPLIT with the Ordinal parameter here to demonstrate the idea but (again) if you are using on-premises SQL Server you'll need to use a different method (such as DelimitedSplit8K/DelimitedSplit8K_LEAD or a JSON splitter).
SELECT SE.Sound,
V.String,
COUNT(*) OVER (PARTITION BY SE.Sound) AS [Count]
FROM (VALUES('The Rolling Stones'),
('The Beatles'),
('Beetles'),
('Baetles'),
('Beatles')) V(String)
CROSS APPLY (SELECT STRING_AGG(SOUNDEX(SS.[value]),'') WITHIN GROUP (ORDER BY SS.Ordinal)
FROM STRING_SPLIT(V.String,' ',1) SS) SE(Sound);
This returns the following dataset:
Sound String Count
------------ ------------------ -----------
B342 Beetles 3
B342 Baetles 3
B342 Beatles 3
T000B342 The Beatles 1
T000R452S352 The Rolling Stones 1

SQL - Inclusive Matches in column or across multiple rows

I'm sure this has been asked before, but I don't know how to search for it.
First off, I do not want to implement full-text searching. The database contains multiple languages including Chinese and Japanese which pose a huge problem for Full-text indexes.
I have a table like the following:
Table Comment:
UserID int
CommentText nvarchar(400)
I want to do a search against this table and find anything matching multiple words. Normally I would just do something like
select *
from Comment
where CommentText like '%potato%' and CommentText like '%badger%'
But if the two words are in different rows I need to do something like
select
UserID, count(UserID )
from
Comment
where
CommentText like '%potato%' or CommentText like '%badger%'
group by
UserID
having
count(UserID ) > 1
But then if the words are sometimes in the same row and sometimes spread across multiple rows, how do I determine if both words matched?
Cases:
Both words are in a single row.
One word is in row 1 and the other word is in row 2 for the same UserID
One word is in multiple rows for the same UserID (so it returns multiple matches even if it's the same word several times)
My question is: for multiple words, how do I conduct a wildcard search and make sure all the words match at least once for a given UserID?
Thanks in advance
I'm thinking of a CTE to grab all rows that contain matches and concat them for a given userid, but I don't know if I can find something more efficient.
A simple way to approach this would use conditional aggregation:
SELECT UserID
FROM Comment
GROUP BY UserID
HAVING
SUM(CASE WHEN CommentText LIKE '%java%' THEN 1 ELSE 0 END) > 0 AND
SUM(CASE WHEN CommentText LIKE '%python%' THEN 1 ELSE 0 END) > 0;
Each of the sums in the HAVING clause keep track of each individual word you want to match. Only when at least one record has a positive match, for both words, would a user turn up in the result set.
Note that if you plan to continue down this road, you should look into SQL Server's full text capabilities.
https://learn.microsoft.com/en-us/sql/relational-databases/search/full-text-search
Just posting the question caused me to think of the answer.
select distinct UserID from (
select UserID FROM Comment where CommentText like '%java%'
UNION
select UserID FROM Comment where CommentText like '%python%'
) as a

How to sort numbers that are in varchar type under GROUP BY GROUPING SETS in SQL Server?

I have a column called UPC (varchar type) that only contain numbers.
I also have a GROUP BY GROUPING SETS on UPC.
Currently, it's sorted as varchar, which result the numbers not sorted from smallest to largest.
GROUP BY
GROUPING SETS(([UPC]),())
If I use convert the UPC from varchar to bigint, the numbers will be sorted alphabetically but then my last row generated from GROUPING SETS will move from last to first.
GROUP BY
GROUPING SETS(([UPC]),())
ORDER BY convert(bigint, UPC)
Is there a way to make my "grand total" move to the last row and still have my numbers sorted alphabetically?
I'm guessing i might need to use GROUPING_ID?
you can customize your order by clause as in the example below:
Order by case when isnumeric(UPC) = 1 then convert(bigint, UPC) else 9999999999 end
the number 9999999999 represents the maximum number you may have

How to get MAX value of a version-number (varchar) column in T-SQL

I have a table defined like this:
Column: Version Message
Type: varchar(20) varchar(100)
----------------------------------
Row 1: 2.2.6 Message 1
Row 2: 2.2.7 Message 2
Row 3: 2.2.12 Message 3
Row 4: 2.3.9 Message 4
Row 5: 2.3.15 Message 5
I want to write a T-Sql query that will get message for the MAX version number, where the "Version" column represents a software version number. I.e., 2.2.12 is greater than 2.2.7, and 2.3.15 is greater than 2.3.9, etc. Unfortunately, I can't think of an easy way to do that without using CHARINDEX or some complicated other split-like logic. Running this query:
SELECT MAX(Version) FROM my_table
will yield the erroneous result:
2.3.9
When it should really be 2.3.15. Any bright ideas that don't get too complex?
One solution would be to use a table-valued split function to split the versions into rows and then combine them back into columns so that you can do something like:
Select TOP 1 Major, Minor, Build
From ( ...derived crosstab query )
Order By Major Desc, Minor Desc, Build Desc
Actually, another way is to use the PARSENAME function which was meant to split object names:
Select TOP 1 Version
From Table
Order By Cast(Parsename( Z.Version , 3 ) As Int) Desc
, Cast(Parsename( Z.Version , 2 ) As Int) Desc
, Cast(Parsename( Z.Version , 1 ) As Int) Desc
Does it have to be efficient on a large table? I suggest you create an indexed persisted computed column that transform the version into a format that ranks correctly, and use the computed column in your queries. Otherwise you'll always scan end to end.
If the table is small, it doesn't matter. Then you can use a just-in-time ranking, using a split function, or (ab)using the parsename as Thomas suggested.

Resources