Compare 2 arrays in Scala Spark Dataframe

Compare 2 arrays in Scala Spark Dataframe - arrays

I have a dataframe with 2 columns of Array[String] like this :
+-------------------+--------------------+--------------------+
| HEURE_USAGE| LISTE_CODE_1| LISTE_CODE_2|
+-------------------+--------------------+--------------------+
|2019-09-06 11:34:57|[GBF401, GO0421, ...|[GB9P01, GO2621, ...|
|2019-09-02 13:27:49|[GO1180, BTMF01, ...|[GO3180, OLMP01, ...|
|2019-09-02 13:17:53|[GO1180, BTMF01, ...|[GO1180, BTMF01, ...|
|2019-09-06 11:27:05|[GBF401, GO0421, ...|[GBX401, GO0721, ...|
+-------------------+--------------------+--------------------+
I'm trying to create a column 'LISTE_CODE_3' that would be the intersection of the column 'LISTE_CODE_1' and the column 'LISTE_CODE_2' for each row.
There is a perfect function that does this in Spark 2.4.
It is the intersect function that returns the intersection without duplication.
Unfortunately, this feature does not exist in Spark 2.2.
I think maybe we should compare sets.
Do you have an idea?

You can either use a user-defined function:
spark.udf.register("intersect_arrays", (a: Seq[String], b: Seq[String]) => a intersect b)
spark.sql("select *, intersect_arrays(LISTE_CODE_1, LISTE_CODE_2) as LISTE_CODE_3 from ds")
Or do it in pure Spark SQL (assuming here that HEURE_USAGE is unique across the dataset):
spark.sql("""
select ds.HEURE_USAGE, LISTE_CODE_1, LISTE_CODE_2, coalesce(inter, array()) as LISTE_CODE_3
from ds left join (
select HEURE_USAGE, collect_list(CODE_1) as inter from (
select * from (
select HEURE_USAGE, CODE_1, explode(LISTE_CODE_2) as CODE_2
from (select HEURE_USAGE, explode(LISTE_CODE_1) as CODE_1, LISTE_CODE_2 as LISTE_CODE_2 from ds)
) where CODE_1 = CODE_2
) group by HEURE_USAGE) t
on t.HEURE_USAGE = ds.HEURE_USAGE""")
The idea is to explode LISTE_CODE_1 and LISTE_CODE_2, to keep only the rows that have a matching CODE_1 and CODE_2, to collect the CODEs into a new array, and to join with the original dataframe to keep all the original rows (even those where the intersection is empty).

Related

Insert array of JSONB objects from one table as multiple rows in second table

We are trying to migrate data from an array column containing JSONB to a proper Postgres table.
{{"a":1,"b": 2, "c":"bar"},{"a": 2, "b": 3, "c":"baz"}}
a | b | c
---+---------+---
1 | 2 | "bar"
2 | 3 | "baz"
As part of the process, we have made several attempts using functions like unnest and array_to_json. In the unnest case, we get several JSONB rows, but cannot figure out how to insert them into the second table. In the array_to_json case, we are able to cast the array to a JSON string, but the json_to_recordset does not seem to accept the JSON string from a common table expression.
What would be a good strategy to 'mirror' the array of JSONB items as a proper table, so that we can run the query inside of a stored procedure, triggered on insert?

Use unnest() in a lateral join:
with my_data(json_column) as (
values (
array['{"a":1,"b":2,"c":"bar"}','{"a":2,"b":3,"c":"baz"}']::jsonb[])
)
select
value->>'a' as a,
value->>'b' as b,
value->>'c' as c
from my_data
cross join unnest(json_column) as value
a | b | c
---+---+-----
1 | 2 | bar
2 | 3 | baz
(2 rows)
You may need some casts or converts, e.g.:
select
(value->>'a')::int as a,
(value->>'b')::int as b,
(value->>'c')::text as c
from my_data
cross join unnest(json_column) as value
Lateral joining means that the function unnest() will be executed for each row from the main table. The function returns elements of the array as value.

Postgres: Need to select keywords as separate array values

Datatype:
id: int4
keywords: text
objectivable_id: int4
Postgres version: PostgreSQL 9.5.3
Business_objectives table:
id keywords objectivable_id
1 keyword1a,keyword1b,keyword1c 6
2 keyword2a 6
3 testing 5
Currently the query I'm using is :
select array(select b.keywords from business_objectives b where b.objectivable_id = 6)
It selects the keywords of matched objectivable_id as:
{"keyword1a,keyword1b,keyword1c","keyword2a"}
Over here I wanted the result to be :
{"keyword1a","keyword1b","keyword1c","keyword2a"}
I tried using "string_agg(text, delimiter)", but it just combines all the keywords into one single pocket of an array.

You can simply (and cheaply!) use:
SELECT string_to_array(string_agg(keywords, ','), ',')
FROM business_objectives
WHERE objectivable_id = 6;
Concatenate your comma separate lists with string_agg(), and then convert the complete text to an array with string_to_array().

So something like this can give you expected result:
SELECT array_agg( j.keys )
FROM business_objectives b,
LATERAL ( SELECT k
FROM unnest ( string_to_array( b.keywords, ',' ) ) u( k )
) j( keys )
WHERE b.objectivable_id = 6;
array_agg
-------------------------------------------
{keyword1a,keyword1b,keyword1c,keyword2a}
(1 row)
With the LATERAL part, we look at the outer query to create a new view. Simply it does split of your keywords as set of rows which you can then feed into array_agg() function.
See more about LATERAL: https://www.postgresql.org/docs/9.6/static/queries-table-expressions.html#QUERIES-LATERAL

PSQL: Count number of wildcard values in JSONB array

My table has a jsonb column that stores JSON arrays of strings in this format:
["ItemA", "ItemB", "ItemC"]
I'm trying to filter the rows based on the number of certain items in the array, using a wildcard for a part of the name of the item.
From what I have read here on SO, I could use the jsonb_to_recordset function and then just query the data normally, but I can't put the pieces together.
How do I use the jsonb_to_recordset to accomplish this? It's asking for a column definition list, but how do I specify one for just a string array?
My hypothetical (but of course not valid) query would look something like this:
SELECT * FROM mytable, jsonb_to_recordset(mytable.jsonbdata) AS text[] WHERE mytable.jsonbdata LIKE 'Item%'
EDIT:
Maybe it could be done using something like this instead:
SELECT * FROM mytable WHERE jsonbdata ? 'Item%';

Use jsonb_array_elements():
select *
from
mytable t,
jsonb_array_elements_text(jsonbdata) arr(elem)
where elem like 'Item%';
jsonbdata | elem
-----------------------------+-------
["ItemA", "ItemB", "ItemC"] | ItemA
["ItemA", "ItemB", "ItemC"] | ItemB
["ItemA", "ItemB", "ItemC"] | ItemC
(3 rows)
Probably you'll want to select only distinct table rows:
select distinct t.*
from
mytable t,
jsonb_array_elements_text(jsonbdata) arr(elem)
where elem like 'Item%';

Hive table Array Columns - explode using array_index

Hi i have a Hive table
select a,b,c,d from riskfactor_table
In the above table B, C and D columns are array columns. Below is my Hive DDL
Create external table riskfactor_table
(a string,
b array<string>,
c array<double>,
d array<double> )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
stored as textfile location 'user/riskfactor/data';
Here is my table data:
ID400S,["jms","jndi","jaxb","jaxn"],[100,200,300,400],[1,2,3,4]
ID200N,["one","two","three"],[212,352,418],[6,10,8]
If i want to split array columns how can i split?
If i use explode function i can split array values for only one column
select explode(b) as b from riskfactor_table;
Output:
jms
jndi
jaxb
jxn
one
two
three
But i want all the columns to be populated using one select statement below-
Query - select a,b,c,d from risk_factor;
Output:
row1- ID400S jms 100 1
row2- ID400S jndi 200 2
row3- ID400S jaxb 300 3
row4- ID400S jaxn 400 4
How can i populate all the data?

You can achieve this using LATERAL VIEW
SELECT Mycoulmna, Mycoulmnb ,Mycoulmnc
FROM riskfactor_table
LATERAL VIEW explode(a) myTablea AS Mycoulmna
LATERAL VIEW explode(a) myTableb AS Mycoulmnb
LATERAL VIEW explode(a) myTablec AS Mycoulmnc ;
for more detail go throw it .

Use the 'numeric_range' UDF from Brickhouse. Here is a blog posting describing the details.
https://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/
In your case, your query would be something like
SELECT a,
array_index( b, i ),
array_index( c, i ),
array_index( d, i )
FROM risk_factor_table
LATERAL VIEW numeric_range( 0, 3 );

I was also looking for same question's solution. Thanks Jerome, for this Brickhouse solution.
I had to make a slight change (addition of alias "n1 as n") as below to make it work for my case:
hive> describe test;
OK
id string
animals array<string>
cnt array<bigint>
hive> select * from test;
OK
abc ["cat","dog","elephant","dolphin","snake","parrot","ant","frog","kuala","cricket"] [10597,2027,1891,1868,1804,1511,1496,1432,1305,1299]
hive> select `id`, array_index(`animals`,n), array_index(`cnt`,n) from test lateral view numeric_range(0,10) n1 as n;
OK
abc cat 10597
abc dog 2027
abc elephant 1891
abc dolphin 1868
abc snake 1804
abc parrot 1511
abc ant 1496
abc frog 1432
abc kuala 1305
abc cricket 1299
The only thing is I have to know beforehand that there are 10 elements to be exploded.

Find valid combinations based on matrix

I have a in CALC the following matrix: the first row (1) contains employee numbers, the first column (A) contains productcodes.
Everywhere there is an X that productitem was sold by the corresponding employee above
| 0302 | 0303 | 0304 | 0402 |
1625 | X | | X | X |
1643 | | X | X | |
...
We see that product 1643 was sold by employees 0303 and 0304
What I would like to see is a list of what product was sold by which employees but formatted like this:
1625 | 0302, 0304, 0402 |
1643 | 0303, 0304 |
The reason for this is that we need this matrix ultimately imported into an SQL SERVER table. We have no access to the origins of this matrix. It contains about 50 employees and 9000+ products.
Thanx for thinking with us!

try something like this
;with data as
(
SELECT *
FROM ( VALUES (1625,'X',NULL,'X','X'),
(1643,NULL,'X','X',NULL))
cs (col1, [0302], [0303], [0304], [0402])
),cte
AS (SELECT col1,
col
FROM data
CROSS apply (VALUES ('0302',[0302]),
('0303',[0303]),
('0304',[0304]),
('0402',[0402])) cs (col, val)
WHERE val IS NOT NULL)
SELECT col1,
LEFT(cs.col, Len(cs.col) - 1) AS col
FROM cte a
CROSS APPLY (SELECT col + ','
FROM cte B
WHERE a.col1 = b.col1
FOR XML PATH('')) cs (col)
GROUP BY col1,
LEFT(cs.col, Len(cs.col) - 1)

I think there are two problems to solve:
get the product codes for the X marks;
concatenate them into a single, comma-separated string.
I can't offer a solution for both issues in one step, but you may handle both issues separately.
1.
To replace the X marks by the respective product codes, you could use an array function to create a second table (matrix). To do so, create a new sheet, copy the first column / first row, and enter the following formula in cell B2:
=IF($B2:$E3="X";$B$1:$E$1;"")
You'll have to adapt the formula, so it covers your complete input data (If your last data cell is Z9999, it would be =IF($B2:$Z9999="X";$B$1:$Z$1;"")). My example just covers two rows and four columns.
After modifying it, confirm with CTRL+SHIFT+ENTER to apply it as array formula.
2.
Now, you'll have to concatenate the product codes. LO Calc lacks a feature to concatenate an array, but you could use a simple user-defined function. For such a string-join function, see this answer. Just create a new macro with the StarBasic code provided there and save it. Now, you have a STRJOIN() function at hand that accepts an array and concatenates its values, leaving empty values out.
You could add that function using a helper column on the second sheet and apply it by dragging it down. Finally, to get rid of the cells with the single product IDs, copy the complete second sheet, paste special into a third sheet, pasting only the values. Now, you can remove all columns except the first one (employee IDs) and the last one (with the concatenated product ids).

I created a table in sql for holding the data:
CREATE TABLE [dbo].[mydata](
[prod_code] [nvarchar](8) NULL,
[0100] [nvarchar](10) NULL,
[0101] [nvarchar](10) NULL,
[and so on...]
I created the list of columns in Calc by copying and pasting them transposed. After that I used the concatenate function to create the columnlist + datatype for the create table statement
I cleaned up the worksheet and imported it into this table using SQL Server's import wizard. Cleaning meant removing unnecessary rows/columns. Since the columnnames were identical mapping was done correctly for 99%.
Now I had the data in SQL Server.
I adapted the code MM93 suggested a bit:
;with data as
(
SELECT *
FROM dbo.mydata <-- here i simply referenced the whole table
),cte
and in the next part I uses the same 'worksheet' trick to list and format all the column names and pasted them in.
),cte
AS (SELECT prod_code, <-- had to replace col1 with 'prod_code'
col
FROM data
CROSS apply (VALUES ('0100',[0100]),
('0101', [0101] ),
(and so on... ),
The result of this query was inserted into a new table and my colleagues and I are querying our harts out :)
PS: removing the 'FOR XML' clause resulted in a table with two columns :
prodcode | employee
which containes al the unique combinations of prodcode + employeenumber which is a lot faster and much more practical to query.