Hive table Array Columns - explode using array_index - arrays

Hi i have a Hive table
select a,b,c,d from riskfactor_table
In the above table B, C and D columns are array columns. Below is my Hive DDL
Create external table riskfactor_table
(a string,
b array<string>,
c array<double>,
d array<double> )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
stored as textfile location 'user/riskfactor/data';
Here is my table data:
ID400S,["jms","jndi","jaxb","jaxn"],[100,200,300,400],[1,2,3,4]
ID200N,["one","two","three"],[212,352,418],[6,10,8]
If i want to split array columns how can i split?
If i use explode function i can split array values for only one column
select explode(b) as b from riskfactor_table;
Output:
jms
jndi
jaxb
jxn
one
two
three
But i want all the columns to be populated using one select statement below-
Query - select a,b,c,d from risk_factor;
Output:
row1- ID400S jms 100 1
row2- ID400S jndi 200 2
row3- ID400S jaxb 300 3
row4- ID400S jaxn 400 4
How can i populate all the data?

You can achieve this using LATERAL VIEW
SELECT Mycoulmna, Mycoulmnb ,Mycoulmnc
FROM riskfactor_table
LATERAL VIEW explode(a) myTablea AS Mycoulmna
LATERAL VIEW explode(a) myTableb AS Mycoulmnb
LATERAL VIEW explode(a) myTablec AS Mycoulmnc ;
for more detail go throw it .

Use the 'numeric_range' UDF from Brickhouse. Here is a blog posting describing the details.
https://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/
In your case, your query would be something like
SELECT a,
array_index( b, i ),
array_index( c, i ),
array_index( d, i )
FROM risk_factor_table
LATERAL VIEW numeric_range( 0, 3 );

I was also looking for same question's solution. Thanks Jerome, for this Brickhouse solution.
I had to make a slight change (addition of alias "n1 as n") as below to make it work for my case:
hive> describe test;
OK
id string
animals array<string>
cnt array<bigint>
hive> select * from test;
OK
abc ["cat","dog","elephant","dolphin","snake","parrot","ant","frog","kuala","cricket"] [10597,2027,1891,1868,1804,1511,1496,1432,1305,1299]
hive> select `id`, array_index(`animals`,n), array_index(`cnt`,n) from test lateral view numeric_range(0,10) n1 as n;
OK
abc cat 10597
abc dog 2027
abc elephant 1891
abc dolphin 1868
abc snake 1804
abc parrot 1511
abc ant 1496
abc frog 1432
abc kuala 1305
abc cricket 1299
The only thing is I have to know beforehand that there are 10 elements to be exploded.

Related

Comparing values between records in a table using Informatica PowerCenter

Consider a table with the following records in a Database:
>>> Table A:
Col_1 Col_2 Col_3
GGG 123 -
GGG 123 X
GGG 123 Y
KKK 786 X
MMM 999 Y
DDD 456 X
DDD 456 U
Wherever we have records with matching values in col_1 and col_2, and we have values X and Y in col_3, the records with X and Y must be deleted. In other cases, we should keep the records.
For example in the above table, the output should look like this:
>>> Output_Table:
Col_1 Col_2 Col_3
GGG 123 -
KKK 786 X
MMM 999 Y
DDD 456 X
DDD 456 U
How this scenario can be implemented (using expression transformation, variable ports, lookup and so on...)? Any help would be greatly appreciated.
There can be multiple scenarios. And i am not sure if your issue is exactly like you described but i will answer as per your question.
Assuming Col_3 can have 'X','Y' - as hardcoded value you want to remove. The values you are trying to remvoe are hardcoded.
First sort the data based on Col_1,Col_2.
Then use EXP transformation and create 7 ports like below. Here we will compare one row with its previous row and see if they are same or not. If same, then concat col3 into one single column.
col1
col2
in_col3
v_col3= iif(v_prev_col1=col1 and v_prev_col2=col2,col3,v_col3||''||col3)
v_prev_col1=col1
v_prev_col2=col2
o_col3=v_col3
After that use an aggregator - group by ports will be col1,col2. And then col3 will be MAX(o_col3) from expression before. Agg will stamp concatenated col3 into one single column.
Then add a filter like below to check if you have XY or YX for duplicate rows.
iif(max_col3='XY' or reverse(max_col3)='XY',FALSE,TRUE) -- You can place any hardcode values here.
EDIT :
5. Now, if you want to get original data (like in comments) excluding XY combination, then use a joiner.
use a joiner now, join output from step 4 and output of step 1. It will be a normal join on Col_1,Col_2.
And the output of the joiner will have no XY combination.
Whole mapping should look like this
|->2.EXP-->3.AGG-->4.FIL--|
-->1.SRT ->|------------------------>|->5.JNR--...--> TGT

Compare 2 arrays in Scala Spark Dataframe

I have a dataframe with 2 columns of Array[String] like this :
+-------------------+--------------------+--------------------+
| HEURE_USAGE| LISTE_CODE_1| LISTE_CODE_2|
+-------------------+--------------------+--------------------+
|2019-09-06 11:34:57|[GBF401, GO0421, ...|[GB9P01, GO2621, ...|
|2019-09-02 13:27:49|[GO1180, BTMF01, ...|[GO3180, OLMP01, ...|
|2019-09-02 13:17:53|[GO1180, BTMF01, ...|[GO1180, BTMF01, ...|
|2019-09-06 11:27:05|[GBF401, GO0421, ...|[GBX401, GO0721, ...|
+-------------------+--------------------+--------------------+
I'm trying to create a column 'LISTE_CODE_3' that would be the intersection of the column 'LISTE_CODE_1' and the column 'LISTE_CODE_2' for each row.
There is a perfect function that does this in Spark 2.4.
It is the intersect function that returns the intersection without duplication.
Unfortunately, this feature does not exist in Spark 2.2.
I think maybe we should compare sets.
Do you have an idea?
You can either use a user-defined function:
spark.udf.register("intersect_arrays", (a: Seq[String], b: Seq[String]) => a intersect b)
spark.sql("select *, intersect_arrays(LISTE_CODE_1, LISTE_CODE_2) as LISTE_CODE_3 from ds")
Or do it in pure Spark SQL (assuming here that HEURE_USAGE is unique across the dataset):
spark.sql("""
select ds.HEURE_USAGE, LISTE_CODE_1, LISTE_CODE_2, coalesce(inter, array()) as LISTE_CODE_3
from ds left join (
select HEURE_USAGE, collect_list(CODE_1) as inter from (
select * from (
select HEURE_USAGE, CODE_1, explode(LISTE_CODE_2) as CODE_2
from (select HEURE_USAGE, explode(LISTE_CODE_1) as CODE_1, LISTE_CODE_2 as LISTE_CODE_2 from ds)
) where CODE_1 = CODE_2
) group by HEURE_USAGE) t
on t.HEURE_USAGE = ds.HEURE_USAGE""")
The idea is to explode LISTE_CODE_1 and LISTE_CODE_2, to keep only the rows that have a matching CODE_1 and CODE_2, to collect the CODEs into a new array, and to join with the original dataframe to keep all the original rows (even those where the intersection is empty).

SQL Finding Records Containing Same Character 1 or More Times

I did not find anything in my search. Also, if anyone has a better suggestion for a title, feel free to edit my post.
What I am trying to do is find records that look like this:
Xxxxxxx
Aaaaaaaa
aaaaaaaa
bBbbbbbbb
I do NOT want to return records that look like this
abcdef
123 abc
123 aaaaaa
Is there anyway to do this?
Edit #1:
Basically, I want to find records where the column contains only 1 character, regardless of the case, repeated multiple times.
If you want strings that are all the same character, one method uses replace():
where len(replace(upper(col), upper(left(col, 1)), '')) = 0
upper() is not needed for case-insensitive collations.
You can also use replicate():
where upper(col) = replicate(left(upper(col), 1), len(col))
One idea, using NGrams8k is to check the maximum and minimum characters are the same:
SELECT V.String
FROM (VALUES(1,'Xxxxxxx'),
(2,'Aaaaaaaa'),
(3,'aaaaaaaa'),
(4,'bBbbbbbbb'),
(5,'abcdef'),
(6,'123 abc'),
(7,'123 aaaaaa'))V(ID,String)
CROSS APPLY dbo.NGrams8k(V.String,1) NG
GROUP BY V.ID,
V.String
HAVING MAX(UPPER(NG.token)) = MIN(UPPER(NG.token));
Another way...
-- Sample Data
DECLARE #t TABLE (string VARCHAR(100));
INSERT #t VALUES ('a'),('b'),
('Xxxxxxx'),('Aaaaaaaa'),('aaaaaaaa'),('bBbbbbbbb'),('abcdef'),('123 abc'),('123 aaaaaa');
-- Solution
SELECT t.string
FROM #t AS t
WHERE LEN(t.string) > 1
AND PATINDEX('%[^'+LEFT(t.string,1)+']%', SUBSTRING(t.string,2,8000)) = 0;
Results:
string
-------------------
Xxxxxxx
Aaaaaaaa
aaaaaaaa
bBbbbbbbb

Find valid combinations based on matrix

I have a in CALC the following matrix: the first row (1) contains employee numbers, the first column (A) contains productcodes.
Everywhere there is an X that productitem was sold by the corresponding employee above
| 0302 | 0303 | 0304 | 0402 |
1625 | X | | X | X |
1643 | | X | X | |
...
We see that product 1643 was sold by employees 0303 and 0304
What I would like to see is a list of what product was sold by which employees but formatted like this:
1625 | 0302, 0304, 0402 |
1643 | 0303, 0304 |
The reason for this is that we need this matrix ultimately imported into an SQL SERVER table. We have no access to the origins of this matrix. It contains about 50 employees and 9000+ products.
Thanx for thinking with us!
try something like this
;with data as
(
SELECT *
FROM ( VALUES (1625,'X',NULL,'X','X'),
(1643,NULL,'X','X',NULL))
cs (col1, [0302], [0303], [0304], [0402])
),cte
AS (SELECT col1,
col
FROM data
CROSS apply (VALUES ('0302',[0302]),
('0303',[0303]),
('0304',[0304]),
('0402',[0402])) cs (col, val)
WHERE val IS NOT NULL)
SELECT col1,
LEFT(cs.col, Len(cs.col) - 1) AS col
FROM cte a
CROSS APPLY (SELECT col + ','
FROM cte B
WHERE a.col1 = b.col1
FOR XML PATH('')) cs (col)
GROUP BY col1,
LEFT(cs.col, Len(cs.col) - 1)
I think there are two problems to solve:
get the product codes for the X marks;
concatenate them into a single, comma-separated string.
I can't offer a solution for both issues in one step, but you may handle both issues separately.
1.
To replace the X marks by the respective product codes, you could use an array function to create a second table (matrix). To do so, create a new sheet, copy the first column / first row, and enter the following formula in cell B2:
=IF($B2:$E3="X";$B$1:$E$1;"")
You'll have to adapt the formula, so it covers your complete input data (If your last data cell is Z9999, it would be =IF($B2:$Z9999="X";$B$1:$Z$1;"")). My example just covers two rows and four columns.
After modifying it, confirm with CTRL+SHIFT+ENTER to apply it as array formula.
2.
Now, you'll have to concatenate the product codes. LO Calc lacks a feature to concatenate an array, but you could use a simple user-defined function. For such a string-join function, see this answer. Just create a new macro with the StarBasic code provided there and save it. Now, you have a STRJOIN() function at hand that accepts an array and concatenates its values, leaving empty values out.
You could add that function using a helper column on the second sheet and apply it by dragging it down. Finally, to get rid of the cells with the single product IDs, copy the complete second sheet, paste special into a third sheet, pasting only the values. Now, you can remove all columns except the first one (employee IDs) and the last one (with the concatenated product ids).
I created a table in sql for holding the data:
CREATE TABLE [dbo].[mydata](
[prod_code] [nvarchar](8) NULL,
[0100] [nvarchar](10) NULL,
[0101] [nvarchar](10) NULL,
[and so on...]
I created the list of columns in Calc by copying and pasting them transposed. After that I used the concatenate function to create the columnlist + datatype for the create table statement
I cleaned up the worksheet and imported it into this table using SQL Server's import wizard. Cleaning meant removing unnecessary rows/columns. Since the columnnames were identical mapping was done correctly for 99%.
Now I had the data in SQL Server.
I adapted the code MM93 suggested a bit:
;with data as
(
SELECT *
FROM dbo.mydata <-- here i simply referenced the whole table
),cte
and in the next part I uses the same 'worksheet' trick to list and format all the column names and pasted them in.
),cte
AS (SELECT prod_code, <-- had to replace col1 with 'prod_code'
col
FROM data
CROSS apply (VALUES ('0100',[0100]),
('0101', [0101] ),
(and so on... ),
The result of this query was inserted into a new table and my colleagues and I are querying our harts out :)
PS: removing the 'FOR XML' clause resulted in a table with two columns :
prodcode | employee
which containes al the unique combinations of prodcode + employeenumber which is a lot faster and much more practical to query.

Sum of values of json array in PostgreSQL

In PostgreSQL 9.3, I have a table like this
id | array_json
---+----------------------------
1 | ["{123: 456}", "{789: 987}", "{111: 222}"]
2 | ["{4322: 54662}", "{123: 5121}", "{1: 5345}" ... ]
3 | ["{3232: 413}", "{5235: 22}", "{2: 5453}" ... ]
4 | ["{22: 44}", "{12: 4324}", "{234: 4235}" ... ]
...
I want to get the sum of all values in array_json column. So, for example, for first row, I want:
id | total
---+-------
1 | 1665
Where 1665 = 456 + 987 + 222 (the values of all the elements of json array). No previous information about the keys of the json elements (just random numbers)
I'm reading the documentation page about JSON functions in PostgreSQL 9.3, and I think I should use json_each, but can't find the right query. Could you please help me with it?
Many thanks in advance
You started looking at the right place (going to the docs is always the right place).
Since your values are JSON arrays -> I would suggest using json_array_elements(json)
And since it's a json array which you have to explode to several rows, and then combine back by running sum over json_each_text(json) - it would be best to create your own function (Postgres allows it)
As for your specific case, assuming the structure you provided is correct, some string parsing + JSON heavy wizardry can be used (let's say your table name is "json_test_table" and the columns are "id" and "json_array"), here is the query that does your "thing"
select id, sum(val) from
(select id,
substring(
json_each_text(
replace(
replace(
replace(
replace(
replace(json_array,':','":"')
,'{',''),
'}','')
,']','}')
,'[','{')::json)::varchar
from '\"(.*)\"')::int as val
from json_test_table) j group by id ;
if you plan to run it on a huge dataset - keep in mind string manipulations are expensive in terms of performance
You can get it using this:
/*
Sorry, sqlfiddle is busy :p
CREATE TABLE my_table
(
id bigserial NOT NULL,
array_json json[]
--,CONSTRAINT my_table_pkey PRIMARY KEY (id)
)
INSERT INTO my_table(array_json)
values (array['{"123": 456}'::json, '{"789": 987}'::json, '{"111": 222}'::json]);
*/
select id, sum(json_value::integer)
from
(
select id, json_data->>json_object_keys(json_data) as json_value from
(
select id, unnest(array_json) as json_data from my_table
) A
) B
group by id

Resources