Dedup rows with case insensitive (Snowflake) - snowflake-cloud-data-platform

I would like to dedup rows when there are multiple instances.
original table:
ID
Name
1
Apple
2
Banana
1
apple
2
APPLE
3
BANANA
desired output after deduping (prioritize lowercase when there are multiple cases):
ID
Name
2
Banana
1
apple
2
apple
3
Banana
The ID 1 "Apple" was removed because ID 1 "apple" exists.
The ID 2 "APPLE" becomes "apple" because there is ID 1 "apple".
The ID 3 "BANANA" became "Banana" because lowercase is priority.
Following statement only works for group by ID. Therefore, the ID 2 "APPLE" stays "APPLE" and ID 3 "BANANA" stays "BANANA" that is not desirable.
create table DELETE2 as select ID, max(Name) as Name
FROM TEST."PUBLIC"."DELETE1"
group by ID, lower(Name);
drop table DELETE1;
alter table DELETE2 rename to DELETE1;

Working SQL you can paste into Snowflake and run:
Technique ... make all words into array of chars -> turn each char into ascii ... sum ascii. little letters have higher ascii than caps.
No updates ... no functions ... just plain old SQL ;-)
with cte as (
select 1 ID, 'Apple' name
union select 2 ID, 'Banana' name
union select 1 ID, 'apple' name
union select 2 ID, 'APPLE' name
union select 3 ID, 'BANANA' name ),
lu as (
select
name,
lower (name) lu_name,
sum(ascii(a.value :: string)) ac,
max(ac) over (partition by lower(name)) mac,
iff ( max(ac) over (partition by lower(name)) = sum(ascii(a.value :: string)),name, null) g
from
cte,
lateral flatten(
input => split(regexp_replace(name, '.', ',\\0', 2), ',')
) a
group by 1,2
)
select
cte.id, lu.name
from
cte
left outer join lu on lower(cte.name) = lu.lu_name and lu.g is not null
group by 1, 2

How about:
create table DELETE2 as
select ID, Name
from (
select ID, lower(Name) as Name1, max(Name) as Name
FROM TEST."PUBLIC"."DELETE1"
group by ID, lower(Name)
)
;

Related

Select row number inside a select query SQL Server

I have a Select query which returns data based on highest mark received as follows.
SELECT
Name, Mark,
ROW_NUMBER() OVER (ORDER BY Marks) AS Rank
FROM
table_1
WHERE
IsDelete ='false'
Result:
Rank
Name
Mark
1
User1
10
2
User2
8
3
User11
6
I have another query which returns data from same table which have name matching to search text as follows.
SELECT
FROM table_1
WHERE name LIKE '%' + #SearchText + '%'
ORDER BY Marks
Name
Mark
User1
10
User2
8
User11
6
I need the row number of each candidates based on their marks and matching the search text given in a single query.
The result should be like this when I enter 'r1' as search text
Rank
Name
Mark
1
User1
10
3
User11
6
Another way to do this is with a CTE (common table expression), which often improves the readability (at least for me): e.g.
WITH HighestMarks (Name, Mark, Rank)
AS (
SELECT Name, Mark,
ROW_NUMBER() OVER (ORDER BY Marks) AS Rank
FROM table_1
WHERE IsDelete ='false'
)
SELECT * FROM HighestMarks WHERE name LIKE '%'+#SearchText+'%' ORDER BY Mark
Use a subquery:
select t.*
from (select Name, Mark, ROW_NUMBER() OVER(ORDER BY Marks) as Rank
from table_1
where IsDelete = 'false'
) t
where name like '%' + #SearchText + '%';
Note: You should not be munging the query with values such as #SearchText. Use parameters!!

how to combine two columns to one column like a map in hive?

In hive I have two columns in a table:
user_id product_id score
1 1, 2, 3 0.7, 0.2, 0.1
2 2, 3, 1 0.5, 0.25, 0.25
The type of product_id and score are both string. Now I wish to generate a new column which is combined by product_id and score like this:
user_id product_score
1 1:0.7, 2:0.2, 3:0.1
2 2:0.5, 3:0.25, 1:0.25
In the new table, the column product_score is like a map, the product_id is the key and the score is the value, but it is actually still a string. The product_id and score is connected by ':'. The different product_ids are connected by ',' and oredered by the initial order in product_id in initial table. How can I achieve this?
Use split() to get arrays, map() to convert to map
select user_id,
map(product_id[0], score[0],
product_id[1], score[1],
product_id[2], score[2]
) as product_score
(
select user_id, split(product_id,',') as product_id, split(score,',') as score
from ...
)s;
Solved - merge two arrays columns like key and value map with order.
Approach - Explode array with posexplode method and get equal pos value from multiple columns
SQL Query -
with rowidcol as
(
select user_id, split(product_id, ',') prod_arr, split(score, ',') score_arr, row_number() over() as row_id
from prod
),
coltorows as
(
select row_id, user_id, prod_arr[prd_index] product, score_arr[score_index] score, prd_index, score_index
from rowidcol
LATERAL view posexplode(prod_arr) ptable as prd_index, pdid
LATERAL view posexplode(score_arr) prtable as score_index, sid
),
colselect as
(
select row_id, user_id, collect_list(concat(product, ':', score)) product_score
from coltorows
where prd_index = score_index
group by row_id, user_id
)
select user_id, concat_ws(',', product_score) as prodcut_score
from colselect
order by user_id;
Input -
Table Name - Prod -
user_id product_id score
1 A,B,C,D 10,20,30,40
2 X,Y,Z 1,2,3
3 K,F,G 100,200,300
Output -
user_id prodcut_score
1 A:10,B:20,C:30,D:40
2 X:1,Y:2,Z:3
3 K:100,F:200,G:300

TSQL : Find PAIR Sequence in a table

I have following table in T-SQL(there are other columns too but no identity column or primary key column):
Oid Cid
1 a
1 b
2 f
3 c
4 f
5 a
5 b
6 f
6 g
7 f
So in above example I would like to highlight that following Oid are duplicate when looking at Cid column values as "PAIRS":
Oid:
1 (1 matches Oid: 5)
2 (2 matches Oid: 4 and 7)
Please NOTE that Oid 2 match did not include Oid 6, since the pair of 6 has letter 'G' as well.
Is it possible to create a query without using While loop to highlight the "Oid" like above? along with how many other matches count exist in database?
I am trying to find the patterns within the dataset relating to these two columns. Thank you in Advance.
Here is a worked example - see comments for explanation:
--First set up your data in a temp table
declare #oidcid table (Oid int, Cid char(1));
insert into #oidcid values
(1,'a'),
(1,'b'),
(2,'f'),
(3,'c'),
(4,'f'),
(5,'a'),
(5,'b'),
(6,'f'),
(6,'g'),
(7,'f');
--This cte gets a table with all of the cids in order, for each oid
with cte as (
select distinct Oid, (select Cid + ',' from #oidcid i2
where i2.Oid = i.Oid order by Cid
for xml path('')) Cids
from #oidcid i
)
select Oid, cte.Cids
from cte
inner join (
-- Here we get just the lists of cids that appear more than once
select Cids, Count(Oid) as OidCount
from cte group by Cids
having Count(Oid) > 1 ) as gcte on cte.Cids = gcte.Cids
-- And when we list them, we are showing the oids with duplicate cids next to each other
Order by cte.Cids
select o1.Cid, o1.Oid, o2.Oid
, count(*) + 1 over (partition by o1.Cid) as [cnt]
from table o1
join table o2
on o1.Cid = o2.Cid
and o1.Oid < o2.Oid
order by o1.Cid, o1.Oid, o2.Oid
Maybe Like this then:
WITH CTE AS
(
SELECT Cid, oid
,ROW_NUMBER() OVER (PARTITION BY cid ORDER BY cid) AS RN
,SUM(1) OVER (PARTITION BY oid) AS maxRow2
,SUM(1) OVER (PARTITION BY cid) AS maxRow
FROM oid
)
SELECT * FROM CTE WHERE maxRow != 1 AND maxRow2 = 1
ORDER BY oid

SQL Server: Joining in rows via. comma separated field

I'm trying to extract some data from a third party system which uses an SQL Server database. The DB structure looks something like this:
Order
OrderID OrderNumber
1 OX101
2 OX102
OrderItem
OrderItemID OrderID OptionCodes
1 1 12,14,15
2 1 14
3 2 15
Option
OptionID Description
12 Batteries
14 Gift wrap
15 Case
[etc.]
What I want is one row per order item that includes a concatenated field with each option description. So something like this:
OrderItemID OrderNumber Options
1 OX101 Batteries\nGift Wrap\nCase
2 OX101 Gift Wrap
3 OX102 Case
Of course this is complicated by the fact that the options are a comma separated string field instead of a proper lookup table. So I need to split this up by comma in order to join in the options table, and then concat the result back into one field.
At first I tried creating a function which splits out the option data by comma and returns this as a table. Although I was able to join the result of this function with the options table, I wasn't able to pass the OptionCodes column to the function in the join, as it only seemed to work with declared variables or hard-coded values.
Can someone point me in the right direction?
I would use a splitting function (here's an example) to get individual values and keep them in a CTE. Then you can join the CTE to your table called "Option".
SELECT * INTO #Order
FROM (
SELECT 1 OrderID, 'OX101' OrderNumber UNION SELECT 2, 'OX102'
) X;
SELECT * INTO #OrderItem
FROM (
SELECT 1 OrderItemID, 1 OrderID, '12,14,15' OptionCodes
UNION
SELECT 2, 1, '14'
UNION
SELECT 3, 2, '15'
) X;
SELECT * INTO #Option
FROM (
SELECT 12 OptionID, 'Batteries' Description
UNION
SELECT 14, 'Gift Wrap'
UNION
SELECT 15, 'Case'
) X;
WITH N AS (
SELECT I.OrderID, I.OrderItemID, X.items OptionCode
FROM #OrderItem I CROSS APPLY dbo.Split(OptionCodes, ',') X
)
SELECT Q.OrderItemID, Q.OrderNumber,
CONVERT(NVarChar(1000), (
SELECT T.Description + ','
FROM N INNER JOIN #Option T ON N.OptionCode = T.OptionID
WHERE N.OrderItemID = Q.OrderItemID
FOR XML PATH(''))
) Options
FROM (
SELECT N.OrderItemID, O.OrderNumber
FROM #Order O INNER JOIN N ON O.OrderID = N.OrderID
GROUP BY N.OrderItemID, O.OrderNumber) Q
DROP TABLE #Order;
DROP TABLE #OrderItem;
DROP TABLE #Option;

Getting filtered results with subquery

I have a table with something like the following:
ID Name Color
------------
1 Bob Blue
2 John Yellow
1 Bob Green
3 Sara Red
3 Sara Green
What I would like to do is return a filtered list of results whereby the following data is returned:
ID Name Color
------------
1 Bob Blue
2 John Yellow
3 Sara Red
i.e. I would like to return 1 row per user. (I do not mind which row is returned for the particular user - I just need that the [ID] is unique.) I have something already that works but is really slow where I create a temp table adding all the ID's and then using a "OUTER APPLY" selecting the top 1 from the same table, i.e.
CREATE TABLE #tb
(
[ID] [int]
)
INSERT INTO #tb
select distinct [ID] from MyTable
select
T1.[ID],
T2.[Name],
T2.Color
from
#tb T1
OUTER APPLY
(
SELECT TOP 1 * FROM MyTable T2 WHERE T2.[ID] = T1.[ID]
) AS V2
DROP TABLE #tb
Can somebody suggest how I may improve it?
Thanks
Try:
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) AS 'RowNo',
ID, Name, Color
FROM table
)
SELECT ID,Name,color
FROM CTE
WHERE RowNo = 1
or
select
*
from
(
Select
ID, Name, Color,
rank() over (partition by Id order by sum(Name) desc) as Rank
from
table
group by
ID
)
HRRanks
where
rank = 1
If you're using SQL Server 2005 or higher, you could use the Ranking functions and just grab the first one in the list.
http://msdn.microsoft.com/en-us/library/ms189798.aspx

Resources