How to convert multiple column values into rows in hive?

How to convert multiple column values into rows in hive? - arrays

Input:
ID COLUMN1 COLUMN2 COLUMN3
1 M,S,E,T 1,2,3,4 5,6,7
2 A,B,C 6,5,8,7,9,1 2,4,3,0,1
Output:
ID COLUMN1 COLUMN2 COLUMN3
1 M 10 50
1 S 20 60
1 E 30 70
1 T 40 NULL
2 A 6 2
2 B 5 4
2 C 8 3
2 NULL 7 0
2 NULL 9 1
2 NULL 1 NULL
Code:
select ID,
array_index( COLUMN1_arr, n ) as COLUMN1,
array_index( COLUMN2_arr, n ) as COLUMN2
from sample
lateral view numeric_range(size(COLUMN1_arr)) n1 as n;
Error:
FAILED: Semantic Exception [Error 10011]: Invalid function array_index
Here I'm having a multiple values in single column i need to convert it to rows as mentioned Output.

Explode Is an UDTF provided in hive you can use the same to split data from columns into rows.
SELECT ID1, col1,col2,col3
FROM tableName
lateral view explode(split(COLUMN1,',')) cols1 AS col1
lateral view explode(split(COLUMN2,',')) cols2 AS col2
lateral view explode(split(COLUMN3,',')) cols3 AS col3

Plain vanilla hive solution, without brickhouse UDFs.
Demo:
with
input as ( ---------------Input dataset
select stack(2,
1, array('M','S','E','T'), array(1,2,3,4), array(5,6,7),
2, array('A','B','C'), array(6,5,8,7,9,1), array( 2,4,3,0,1)
) as (ID,COLUMN1,COLUMN2,COLUMN3)
),
--explode each array and FULL join them
c1 as (
select i.id, v.column1, p
from input i
lateral view posexplode(i.COLUMN1) v as p,column1
),
c2 as (
select i.id, v.column2, p
from input i
lateral view posexplode(i.COLUMN2) v as p,column2
),
c3 as (
select i.id, v.column3, p
from input i
lateral view posexplode(i.COLUMN3) v as p,column3
)
--FULL JOIN
select coalesce(c1.id,c2.id,c3.id) id, c1.column1, c2.column2, c3.column3
from c1
full join c2 on c1.id=c2.id and c1.p=c2.p
full join c3 on nvl(c1.id,c2.id)=c3.id and nvl(c1.p,c2.p)=c3.p --note NVL usage
;
Result:
OK
id column1 column2 column3
1 M 1 5
1 S 2 6
1 E 3 7
1 T 4 NULL
2 A 6 2
2 B 5 4
2 C 8 3
2 NULL 7 0
2 NULL 9 1
2 NULL 1 NULL

Related

T-SQL select rows where [col] = MIN([col])

I have a data set produced from a UNION query that aggregates data from 2 sources.
I want to select that data based on whether or not data was found in only of those sources,or both.
The data relevant parts of the set looks like this, there are a number of other columns:
row
preference
group
position
1
1
111
1
2
1
111
2
3
1
111
3
4
1
135
1
5
1
135
2
6
1
135
3
7
2
111
1
8
2
135
1
The [preference] column combined with the [group] column is what I'm trying to filter on, I want to return all the rows that have the same [preference] as the MIN([preference]) for each [group]
The desired output given the data above would be rows 1 -> 6
The [preference] column indicates the original source of the data in the UNION query so a legitimate data set could look like:
row
preference
group
position
1
1
111
1
2
1
111
2
3
1
111
3
4
2
111
1
5
2
135
1
In which case the desired output would be rows 1,2,3, & 5
What I can't work out is how to do (not real code):
SELECT * WHERE [preference] = MIN([preference]) PARTITION BY [group]

One way to do this is using RANK:
SELECT row
, preference
, [group]
, position
FROM (
SELECT row
, preference
, [group]
, position
, RANK() OVER (PARTITION BY [group] ORDER BY preference) AS seq
FROM t) t2
WHERE seq = 1
Demo here

Should by doable via simple inner join:
SELECT t1.*
FROM t AS t1
INNER JOIN (SELECT [group], MIN(preference) AS preference
FROM t
GROUP BY [group]
) t2 ON t1.[group] = t2.[group]
AND t1.preference = t2.preference

Filling missing rows , RIGHT JOIN with each group of GROUP BY

On one table I have Id, and Name of 10 tests whitch should be done.
On second SN product, TestDate, and Id test that have been done to this product.
I need to find, and show tests whitch should be done but they are not.
Solution with CROSS JOIN, and LEFT OUTER JOIN works for 1000 rows, but for 8000-15000 it takes a long time 1-3 minutes.
Data are prepared by CTE query
example below
I want to add "missing" row to each group #Table2
#Table1 => four tests which should be done
number - Id of test
data3 -name of test
#Table2 => tests which were done
data1 - id of tested device
GROUP => tests of one device
DECLARE #table1 TABLE (data3 NVARCHAR(20), number INT)
DECLARE #table2 TABLE (data1 NVARCHAR(20), data2 NVARCHAR(20), number INT)
INSERT INTO #table1
SELECT 'xx', 1
UNION ALL
SELECT 'ee', 2
UNION ALL
SELECT 'zz', 3
UNION ALL
SELECT 'gg', 4
INSERT INTO #table2
SELECT '1', 'aaaaaaaaaa', 1 --GROUP 1
UNION ALL
SELECT '1', 'aaaaaaaaaa', 2 --GROUP 1
UNION ALL
SELECT '1', 'aaaaaaaaaa', 3 --GROUP 1
UNION ALL
SELECT '2', 'bbbbbbbbbb', 1 --GROUP 2
UNION ALL
SELECT '2', 'bbbbbbbbbb', 2 --GROUP 2
UNION ALL
SELECT '3', 'cccccccccc', 1 --GROUP 3
UNION ALL
SELECT '3', 'cccccccccc', 3 --GROUP 3
With this query only one row was added (first one), I need to fill each group of table2
If my group is eg. GROUP BY data1,data2
SELECT *
FROM #table2 t2
RIGHT JOIN #table1 t1 ON t2.number = t1.number
ORDER BY t2.data1, t1.number
Output:
data1 data2 number data3 number
-----------------------------------------
NULL NULL NULL gg 4
1 aaaaaaaaaa 1 xx 1
1 aaaaaaaaaa 2 ee 2
1 aaaaaaaaaa 3 zz 3
2 bbbbbbbbbb 1 xx 1
2 bbbbbbbbbb 2 ee 2
3 cccccccccc 3 zz 3
3 cccccccccc 1 xx 1
This is my required output (although only one 'number' column would also work)
data1 data2 number number3
-----------------------------------------
1 aaaaaaaaaa 1 1 --GROUP 1
1 aaaaaaaaaa 2 2 --GROUP 1
1 aaaaaaaaaa 3 3 --GROUP 1
NULL NULL NULL 4 --GROUP 1
2 bbbbbbbbbb 1 1 --GROUP 2
2 bbbbbbbbbb 2 2 --GROUP 2
NULL NULL NULL 3 --GROUP 2
NULL NULL NULL 4 --GROUP 2
3 cccccccccc 1 1 --GROUP 3
NULL NULL NULL 2 --GROUP 3
3 cccccccccc 3 3 --GROUP 3
NULL NULL NULL 4 --GROUP 3

Why your data1 are null for missing values? I guess they must be filled from table2. Try this query:
;with cte as (
select
distinct a.data1, b.number, b.data3
from
#table2 a
cross join #table1 b
)
select
c.data1, t.data2, t.number, c.data3, number3 = c.number
from
cte c
left join #table2 t on c.data1 = t.data1 and c.number = t.number
Output
data1 data2 number data3 number3
---------------------------------------------
1 aaaaaaaaaa 1 xx 1
1 aaaaaaaaaa 2 ee 2
1 aaaaaaaaaa 3 zz 3
1 NULL NULL gg 4
2 bbbbbbbbbb 1 xx 1
2 bbbbbbbbbb 2 ee 2
2 NULL NULL zz 3
2 NULL NULL gg 4
3 cccccccccc 1 xx 1
3 NULL NULL ee 2
3 cccccccccc 3 zz 3
3 NULL NULL gg 4
If you really need to show null values in data1 column, then add case statement to check value of data2.

How to sum a column in SQL Server recursive cte for optimization?

I have following table with hierarchical data:
FolderId ParentFolderId NumberOfAffectedItems
---------------------------------------------
1 NULL 2
2 1 3
3 2 5
4 2 3
5 1 0
I want to find number of affected items under each folders and all of its children. I can write a recursive cte, which can produce following result, after that by doing group by I can find out what I want.
Normal recursive CTE:
WITH FolderTree AS
(
SELECT
fsa.FolderId AS ParentFolderId,
fsa.FolderId AS ChildFolderId,
fsa.NumberOfReportsAffected
FROM
FoldersWithNumberOfReportsAffected fsa
UNION ALL
SELECT
ft.ParentFolderId,
fsa.FolderId AS ChildFolderId,
fsa.NumberOfReportsAffected
FROM
FoldersWithNumberOfReportsAffected fsa
INNER JOIN
FolderTree ft ON fsa.ParentFolderId = ft.ChildFolderId
)
Result:
ParentFolderId ChildFolderId NumberOfAffectedItems
--------------------------------------------------
1 1 2
1 2 3
1 3 5
1 4 3
1 5 0
2 2 3
2 3 5
2 4 3
3 3 5
4 4 3
5 5 0
But I want to optimize it, I want to start from the leaf child, while
moving through the CTE itself, I want to compute NumberOfAffectedItems.
Expected CTE
WITH FolderTree AS
(
SELECT
fsa.FolderId AS LeafChildId,
fsa.FolderId AS ParentFolderId,
fsa.NumberOfReportsAffected
FROM
FoldersWithNumberOfReportsAffected fsa
LEFT JOIN
FoldersWithNumberOfReportsAffected f ON fsa.folderid = f.ParentfolderId
WHERE
f.ParentfolderId is null -- this is finding leaf child
UNION ALL
SELECT
ft.LeafChildId,
fsa.FolderId AS ParentFolderId,
fsa.NumberOfReportsAffected + ft.NumberOfReportsAffected AS [ComputedResult]
FROM
FoldersWithNumberOfReportsAffected fsa
INNER JOIN
FolderTree ft ON fsa.FolderId = ft.ParentFolderId
)
Result:
LeafChildId ParentFolderId ComputedNumberOfAffectedItems
---------------------------------------------------------
3 3 5
3 2 8
3 1 10
4 4 3
4 2 5
4 1 7
5 5 0
5 1 2
If I group by ParentFolderId, I will get a wrong result, the reason is while doing computing in CTE, the same parent folder is visited from multiple
children, hence results in a wrong result. I want to find out is there anyway we can compute the result while going through the CTE itself.

Please check the following solution. I used your cte as basis and added the calculation (as column x) to it:
DECLARE #t TABLE(
FolderID INT
,ParentFolderID INT
,NumberOfAffectedItems INT
);
INSERT INTO #t VALUES (1 ,NULL ,2)
,(2 ,1 ,3)
,(3 ,2 ,5)
,(4 ,2 ,3)
,(5 ,1 ,0);
WITH FolderTree AS
(
SELECT 1lvl,
fsa.FolderId AS LeafChildId,
fsa.ParentFolderId AS ParentFolderId,
fsa.NumberOfAffectedItems
FROM
#t fsa
LEFT JOIN
#t f ON fsa.folderid = f.ParentfolderId
WHERE
f.ParentfolderId is null -- this is finding leaf child
UNION ALL
SELECT lvl + 1,
ft.LeafChildId,
fsa.ParentFolderId,
fsa.NumberOfAffectedItems
FROM
FolderTree ft
INNER JOIN #t fsa
ON fsa.FolderId = ft.ParentFolderId
)
SELECT LeafChildId,
ISNULL(ParentFolderId, LeafChildId) ParentFolderId,
NumberOfAffectedItems,
SUM(NumberOfAffectedItems) OVER (PARTITION BY LeafChildId ORDER BY ISNULL(ParentFolderId, LeafChildId) DESC) AS x
FROM FolderTree
ORDER BY 1, 2 DESC
OPTION (MAXRECURSION 0)
Result:
LeafChildId ParentFolderId NumberOfAffectedItems x
3 3 2 2
3 2 5 7
3 1 3 10
4 4 2 2
4 2 3 5
4 1 3 8
5 5 2 2
5 1 0 2

Performance issue with CTE SQL Server query

We have a table with a parent child relationship, that represents a deep tree structure.
We are using a view with a CTE to query the data but the performance is poor (see code and execution plan below).
Is there any way we can improve the performance?
WITH cte (ParentJobTypeId, Id) AS
(
SELECT
Id, Id
FROM
dbo.JobTypes
UNION ALL
SELECT
e.Id, cte.Id
FROM
cte
INNER JOIN
dbo.JobTypes AS e ON e.ParentJobTypeId = cte.ParentJobTypeId
)
SELECT
ISNULL(Id, 0) AS ParentJobTypeId,
ISNULL(ParentJobTypeId, 0) AS Id
FROM
cte

A quick example of using the range keys. As I mentioned before, hierarchies were 127K points and some sections where 15 levels deep
The cte Builds, let's assume the hier results will be will be stored in a table (indexed as well)
Declare #Table table(ID int,ParentID int,[Status] varchar(50))
Insert #Table values
(1,101,'Pending'),
(2,101,'Complete'),
(3,101,'Complete'),
(4,102,'Complete'),
(101,null,null),
(102,null,null)
;With cteOH (ID,ParentID,Lvl,Seq)
as (
Select ID,ParentID,Lvl=1,cast(Format(ID,'000000') + '/' as varchar(500)) from #Table where ParentID is null
Union All
Select h.ID,h.ParentID,cteOH.Lvl+1,Seq=cast(cteOH.Seq + Format(h.ID,'000000') + '/' as varchar(500)) From #Table h INNER JOIN cteOH ON h.ParentID = cteOH.ID
),
cteR1 as (Select ID,Seq,R1=Row_Number() over (Order by Seq) From cteOH),
cteR2 as (Select A.ID,R2 = max(B.R1) From cteOH A Join cteR1 B on (B.Seq Like A.Seq+'%') Group By A.ID)
Select B.R1
,C.R2
,A.Lvl
,A.ID
,A.ParentID
Into #TempHier
From cteOH A
Join cteR1 B on (A.ID=B.ID)
Join cteR2 C on (A.ID=C.ID)
Select * from #TempHier
Select H.R1
,H.R2
,H.Lvl
,H.ID
,H.ParentID
,Total = count(*)
,Complete = sum(case when D.Status = 'Complete' then 1 else 0 end)
,Pending = sum(case when D.Status = 'Pending' then 1 else 0 end)
,PctCmpl = format(sum(case when D.Status = 'Complete' then 1.0 else 0.0 end)/count(*),'##0.00%')
From #TempHier H
Join (Select _R1=B.R1,A.* From #Table A Join #TempHier B on A.ID=B.ID) D on D._R1 between H.R1 and H.R2
Group By H.R1
,H.R2
,H.Lvl
,H.ID
,H.ParentID
Order By 1
Returns the hier in a #Temp table for now. Notice the R1 and R2, I call these the range keys. Data (without recursion) can be selected and aggregated via these keys
R1 R2 Lvl ID ParentID
1 4 1 101 NULL
2 2 2 1 101
3 3 2 2 101
4 4 2 3 101
5 6 1 102 NULL
6 6 2 4 102
VERY SIMPLE EXAMPLE: Illustrates the rolling the data up the hier.
R1 R2 Lvl ID ParentID Total Complete Pending PctCmpl
1 4 1 101 NULL 4 2 1 50.00%
2 2 2 1 101 1 0 1 0.00%
3 3 2 2 101 1 1 0 100.00%
4 4 2 3 101 1 1 0 100.00%
5 6 1 102 NULL 2 1 0 50.00%
6 6 2 4 102 1 1 0 100.00%
The real beauty of the the range keys, is if you know an ID, you know where it exists (all descendants and ancestors).

How to get the values in comma separated using joins?

I am working on SQL, I have two tables
EId Ename
1 john
2 alex
3 piers
4 sara
And the second table is
PID PNAME EID
1 mcndd 1
2 carter 1
3 leare 2
4 jain 2
The result should be
EID count PID
1 2 1
1 2 2
2 2 3
2 2 4
I want a query for this.i had tried like this
SELECT t1.EID, COUNT(t1.EID) count,PID
from Employertable t1
INNER JOIN persontable P ON P.EID=t1.EID
Group By t1.EID Having Count(T1.EID) > 1

You can do this using window functions. With those functions you can combine aggregated data with non-aggregated data:
DECLARE #t1 TABLE ( EID INT )
DECLARE #t2 TABLE ( PID INT, EID INT )
INSERT INTO #t1
VALUES ( 1 ),
( 2 ),
( 3 ),
( 4 )
INSERT INTO #t2
VALUES ( 1, 1 ),
( 2, 1 ),
( 3, 2 ),
( 4, 2 )
SELECT *
FROM ( SELECT t1.EID ,
COUNT(*) OVER ( PARTITION BY t2.EID ) AS C ,
t2.PID
FROM #t1 t1
JOIN #t2 t2 ON t2.EID = t1.EID
) t
WHERE t.C > 1
Output:
EID C PID
1 2 1
1 2 2
2 2 3
2 2 4

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to convert multiple column values into rows in hive? - arrays

Related

T-SQL select rows where [col] = MIN([col])

Filling missing rows , RIGHT JOIN with each group of GROUP BY

How to sum a column in SQL Server recursive cte for optimization?

Performance issue with CTE SQL Server query

How to get the values in comma separated using joins?

Categories

Resources