Efficient OSQL union query from two classes from the same ancestor - query-optimization

I have a complex class hierarchy of this form:
- A (common ancestor)
* B
- C
* D
* E
* F
- G
* H
* I
From my application I need this kind of query:
SELECT FROM D
WHERE ...
But recently I would like also to do this kind of query:
SELECT FROM A
WHERE #class IN ['D', 'I'] AND ...
My question is how much is efficient this last query, and which is the best practice to optimize it.

trying your query with a sample dataset I've created
SELECT FROM A
WHERE #class IN ['D', 'I']
and i'm seeing execution time from 40ms (cold cache) to 15ms (hot cache).
Using this one, instead, I see an improvement (3ms with hot cache):
select expand($c)
let $a=(SELECT FROM D),
$b=(SELECT FROM I),
$c=unionAll($a,$b)
EDIT
here is the query using WHERE and LIMIT conditions:
select from (
select expand($c)
let $a=(SELECT FROM D),
$b=(SELECT FROM I),
$c=unionAll($a,$b)
)
where value = 5 LIMIT 10
Hope it helps.
Ivan

I state that I don't know if I have understood the meaning of your question.
I understood you want to know if the query is faster with SELECT FROM A WHERE #class IN ['D', 'I'].
I have made a test with 5000 record for class A, 5000 for class B, 5000 for class C etc ...
With the query SELECT FROM A WHERE #class IN ['D', 'I'] limit -1
With the query SELECT EXPAND( $c ) LET $a = ( SELECT FROM D ), $b = ( SELECT FROM I ), $c = UNIONALL( $a, $b ) limit -1
The second query should be faster because it does not select all records of class A (45000) and then evaluates the clause where, but only the 5000 record of type D and 5000 of type I

Related

Determine how similar two arrays are in PostgreSQL

I'm aware you can compare two arrays in PostgreSQL to see if the elements in one are contained in the elements of another like so,
SELECT ARRAY[1,2] <# ARRAY[1,2,3] --> true
Is there any way to get # of matches or say "if matches 2 of 3" ??
SELECT ARRAY[1,2] ?? ARRAY[1,2,3] --> 2/3 or 66.6666%
I'm open to interesting solutions.. I want to take an array and ultimately say it must match 2 of 3 elements from another array in an inline query.. or >= 66% or something of that nature.
Ideally like this..
SELECT * FROM SOMETABLE WHERE ARRAY[1,2] ?? ARRAY[1,2,3] >= 66.66666666666667
Thanks in advance.
From here Array functions
with match as
(select count(a) as match_ct from unnest(ARRAY[1,2] ) as a
join
(select * from unnest(ARRAY[2,1,3]) as b) t on a=t.b)
select
match_ct/total_ct::numeric from match,
(select count(*) as total_ct from unnest(ARRAY[1,2], ARRAY[2,1,3]) as t(a, b)) as total ;
?column?
------------------------
0.66666666666666666667
You could create a function for that:
CREATE FUNCTION array_similarity(anyarray, anyarray)
RETURNS double precision
LANGUAGE sql
IMMUTABLE STRICT AS
$$SELECT 100.0 * count(*) / cardinality($2)
FROM unnest($1) AS a1(e1)
WHERE ARRAY[e1] <# $2$$;

Compare 2 arrays in Scala Spark Dataframe

I have a dataframe with 2 columns of Array[String] like this :
+-------------------+--------------------+--------------------+
| HEURE_USAGE| LISTE_CODE_1| LISTE_CODE_2|
+-------------------+--------------------+--------------------+
|2019-09-06 11:34:57|[GBF401, GO0421, ...|[GB9P01, GO2621, ...|
|2019-09-02 13:27:49|[GO1180, BTMF01, ...|[GO3180, OLMP01, ...|
|2019-09-02 13:17:53|[GO1180, BTMF01, ...|[GO1180, BTMF01, ...|
|2019-09-06 11:27:05|[GBF401, GO0421, ...|[GBX401, GO0721, ...|
+-------------------+--------------------+--------------------+
I'm trying to create a column 'LISTE_CODE_3' that would be the intersection of the column 'LISTE_CODE_1' and the column 'LISTE_CODE_2' for each row.
There is a perfect function that does this in Spark 2.4.
It is the intersect function that returns the intersection without duplication.
Unfortunately, this feature does not exist in Spark 2.2.
I think maybe we should compare sets.
Do you have an idea?
You can either use a user-defined function:
spark.udf.register("intersect_arrays", (a: Seq[String], b: Seq[String]) => a intersect b)
spark.sql("select *, intersect_arrays(LISTE_CODE_1, LISTE_CODE_2) as LISTE_CODE_3 from ds")
Or do it in pure Spark SQL (assuming here that HEURE_USAGE is unique across the dataset):
spark.sql("""
select ds.HEURE_USAGE, LISTE_CODE_1, LISTE_CODE_2, coalesce(inter, array()) as LISTE_CODE_3
from ds left join (
select HEURE_USAGE, collect_list(CODE_1) as inter from (
select * from (
select HEURE_USAGE, CODE_1, explode(LISTE_CODE_2) as CODE_2
from (select HEURE_USAGE, explode(LISTE_CODE_1) as CODE_1, LISTE_CODE_2 as LISTE_CODE_2 from ds)
) where CODE_1 = CODE_2
) group by HEURE_USAGE) t
on t.HEURE_USAGE = ds.HEURE_USAGE""")
The idea is to explode LISTE_CODE_1 and LISTE_CODE_2, to keep only the rows that have a matching CODE_1 and CODE_2, to collect the CODEs into a new array, and to join with the original dataframe to keep all the original rows (even those where the intersection is empty).

How to cast a string to array of struct in HiveQL

I have a hive table with the column "periode", the type of the column is string.
The column have values like the following:
[{periode:20160118-20160205,nb:1},{periode:20161130-20161130,nb:1},{periode:20161130-20161221,nb:1}]
[{periode:20161212-20161217,nb:0}]
I want to cast this column in array<struct<periode:string, nb:int>>.
The final goal is to have one raw by periode.
For this I want to use lateral view with explode on the column periode.
That's why I want to convert it to array<struct<string, int>>
Thanks for help.
Sidi
You don't need to "cast" anything, you just need to explode the array and then unpack the struct. I added an index to your data to make it more clear where things are ending up.
Data:
idx arr_of_structs
0 [{periode:20160118-20160205,nb:1},{periode:20161130-20161130,nb:1},{periode:20161130-20161221,nb:1}]
1 [{periode:20161212-20161217,nb:0}]
Query:
SELECT idx -- index
, my_struct.periode AS periode -- unpacks periode
, my_struct.nb AS nb -- unpacks nb
FROM database.table
LATERAL VIEW EXPLODE(arr_of_structs) exptbl AS my_struct
Output:
idx periode nb
0 20160118-20160205 1
0 20161130-20161130 1
0 20161130-20161221 1
1 20161212-20161217 0
It's a bit unclear from your question what the desired result is, but as soon as you update it I'll modify the query accordingly.
EDIT:
The above solution is incorrect, I didn't catch that your input is a STRING.
Query:
SELECT REGEXP_EXTRACT(tmp_arr[0], "([0-9]{8}-[0-9]{8})") AS periode
, REGEXP_EXTRACT(tmp_arr[1], ":([0-9]*)") AS nb
FROM (
SELECT idx
, pos
, COLLECT_SET(tmp_col) AS tmp_arr
FROM (
SELECT idx
, tmp_col
, CASE WHEN PMOD(pos, 2) = 0 THEN pos+1 ELSE pos END AS pos
FROM (
SELECT *
, ROW_NUMBER() OVER () AS idx
FROM database.table ) x
LATERAL VIEW POSEXPLODE(SPLIT(periode, ',')) exptbl AS pos, tmp_col ) y
GROUP BY idx, pos) z
Output:
periode nb
20160118-20160205 1
20161130-20161130 1
20161130-20161221 1
20161212-20161217 0
What about use the split function? you should be able to do something like
select nb, period from
(select split(periode, "-") as periods, nb from yourtable) t
LATERAL VIEW explode(periods) sss AS period;
I didnt tried but it should work :)
EDIT: the above should work if you have a column periodes following a pattern date-date-date.. and a column nb, but it looks like that it isn't the case here. The following query should work for you (verbose but work)
select period, nb from (
select
regexp_replace(split(split(tok1,",")[1],":")[1], "[\\]|}]", "") as nb,
split(split(split(tok1,",")[0],":")[1],"-") as periods
from
(select split(YOURSTRINGCOLUMN, "},") as s1 from YOURTABLE)
r1 LATERAL VIEW explode(s1) ss1 AS tok1
) r2 LATERAL VIEW explode(periods) ss2 AS period;
I realize this question is 1YO, but I ran into this same issue and tackled it by using the json_split brickhouse UDF.
SELECT EXPLODE(
json_split(
'[{"periode":"20160118-20160205","nb":1},{"periode":"20161130-20161130","nb":1},{"periode":"20161130-20161221","nb":1}]'
));
col
{"periode":"20160118-20160205","nb":1}
{"periode":"20161130-20161130","nb":1}
{"periode":"20161130-20161221","nb":1}
Sorry for the spaghetti code.
There's also a similar question here using JSON arrays instead of JSON strings. It's not the same case, but for anyone facing this kind of task it might be useful in a bigger context.

Postgres: Need to select keywords as separate array values

Datatype:
id: int4
keywords: text
objectivable_id: int4
Postgres version: PostgreSQL 9.5.3
Business_objectives table:
id keywords objectivable_id
1 keyword1a,keyword1b,keyword1c 6
2 keyword2a 6
3 testing 5
Currently the query I'm using is :
select array(select b.keywords from business_objectives b where b.objectivable_id = 6)
It selects the keywords of matched objectivable_id as:
{"keyword1a,keyword1b,keyword1c","keyword2a"}
Over here I wanted the result to be :
{"keyword1a","keyword1b","keyword1c","keyword2a"}
I tried using "string_agg(text, delimiter)", but it just combines all the keywords into one single pocket of an array.
You can simply (and cheaply!) use:
SELECT string_to_array(string_agg(keywords, ','), ',')
FROM business_objectives
WHERE objectivable_id = 6;
Concatenate your comma separate lists with string_agg(), and then convert the complete text to an array with string_to_array().
So something like this can give you expected result:
SELECT array_agg( j.keys )
FROM business_objectives b,
LATERAL ( SELECT k
FROM unnest ( string_to_array( b.keywords, ',' ) ) u( k )
) j( keys )
WHERE b.objectivable_id = 6;
array_agg
-------------------------------------------
{keyword1a,keyword1b,keyword1c,keyword2a}
(1 row)
With the LATERAL part, we look at the outer query to create a new view. Simply it does split of your keywords as set of rows which you can then feed into array_agg() function.
See more about LATERAL: https://www.postgresql.org/docs/9.6/static/queries-table-expressions.html#QUERIES-LATERAL

Hive table Array Columns - explode using array_index

Hi i have a Hive table
select a,b,c,d from riskfactor_table
In the above table B, C and D columns are array columns. Below is my Hive DDL
Create external table riskfactor_table
(a string,
b array<string>,
c array<double>,
d array<double> )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '~'
stored as textfile location 'user/riskfactor/data';
Here is my table data:
ID400S,["jms","jndi","jaxb","jaxn"],[100,200,300,400],[1,2,3,4]
ID200N,["one","two","three"],[212,352,418],[6,10,8]
If i want to split array columns how can i split?
If i use explode function i can split array values for only one column
select explode(b) as b from riskfactor_table;
Output:
jms
jndi
jaxb
jxn
one
two
three
But i want all the columns to be populated using one select statement below-
Query - select a,b,c,d from risk_factor;
Output:
row1- ID400S jms 100 1
row2- ID400S jndi 200 2
row3- ID400S jaxb 300 3
row4- ID400S jaxn 400 4
How can i populate all the data?
You can achieve this using LATERAL VIEW
SELECT Mycoulmna, Mycoulmnb ,Mycoulmnc
FROM riskfactor_table
LATERAL VIEW explode(a) myTablea AS Mycoulmna
LATERAL VIEW explode(a) myTableb AS Mycoulmnb
LATERAL VIEW explode(a) myTablec AS Mycoulmnc ;
for more detail go throw it .
Use the 'numeric_range' UDF from Brickhouse. Here is a blog posting describing the details.
https://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/
In your case, your query would be something like
SELECT a,
array_index( b, i ),
array_index( c, i ),
array_index( d, i )
FROM risk_factor_table
LATERAL VIEW numeric_range( 0, 3 );
I was also looking for same question's solution. Thanks Jerome, for this Brickhouse solution.
I had to make a slight change (addition of alias "n1 as n") as below to make it work for my case:
hive> describe test;
OK
id string
animals array<string>
cnt array<bigint>
hive> select * from test;
OK
abc ["cat","dog","elephant","dolphin","snake","parrot","ant","frog","kuala","cricket"] [10597,2027,1891,1868,1804,1511,1496,1432,1305,1299]
hive> select `id`, array_index(`animals`,n), array_index(`cnt`,n) from test lateral view numeric_range(0,10) n1 as n;
OK
abc cat 10597
abc dog 2027
abc elephant 1891
abc dolphin 1868
abc snake 1804
abc parrot 1511
abc ant 1496
abc frog 1432
abc kuala 1305
abc cricket 1299
The only thing is I have to know beforehand that there are 10 elements to be exploded.

Resources