Array intersect Hive - arrays

I have two arrays of string in Hive like
{'value1','value2','value3'}
{'value1', 'value2'}
I want to merge arrays without duplicates, result:
{'value1','value2','value3'}
How I can do it in hive?

A native solution could be that:
SELECT id, collect_set(item)
FROM table
LATERAL VIEW explode(list) lTable AS item
GROUP BY id;
Firstly explode with lateralview, and next group by and remove duplicates with collect_set.

You will need a UDF for this. Klout has a bunch of opensource HivUDFS under the package
brickhouse. Here is the github link. They have a bunch of UDF's that exactly serves your purpose.
Download,build and add the JAR. Here is an example
CREATE TEMPORARY FUNCTION combine AS 'brickhouse.udf.collect.CombineUDF';
CREATE TEMPORARY FUNCTION combine_unique AS 'brickhouse.udf.collect.CombineUniqueUDAF';
select combine_unique(combine(array('a','b','c'), array('b','c','d'))) from reqtable;
OK
["d","b","c","a"]

Related

ODI 12c CUSTOM_TEMPLATE extract option not working

I'm using the CUSTOM_TEMPLATE extract option on the source table to force a select actually from another table. Which then would be used by a custom IKM I'm using in order to get the column list of the "forced" table with the odiRef.getColList API. But the template select query is not considered at all in the execution, so the IKM still gets the columns from the original table and I don't need them.
The code in the CUSTOM_TEMPLATE is:
select *
from <%=odiRef.getObjectName("L", "#V_OFFL_TABLE_NAME", "OFFLOAD_AREA_HIST", "DWH_LCL", "D") %>
where src_date_from_dt = to_date('V_OFFL_TRANSFER_DATE','YYYY-MM-DD')
The code in the SOURCE tab of the custom IKM I made is:
select <%=odiRef.getSrcColList("","[COL_NAME]",",\n","")%>
from <%=odiRef.getObjectName("L", "#V_OFFL_TABLE_NAME", "OFFLOAD_AREA_HIST", "DWH_LCL", "D") %>
where src_date_from_dt = to_date('V_OFFL_TRANSFER_DATE','YYYY-MM-DD')
in this case I'm trying with odiRef.getSrcColList in the IKM, but I;ve also tried with odiRef.getColList - same result.
Try to create a Dummy Datastore in the Model where you have that table.
You can add dummy or general attributes into that Datastore(Var1,Var2,Var3...Num1,Num2..etc).
Drag this datastore to the source area and add the CUSTOM_TEMPLATE there.
*Make sure this dummy is having the same logical schema as the table in your custom query.
This can also work with multi table query.

flink-sql: how do check if the array type contains the given element?

I am trying to check if a given string is present in an array of strings
Is there flink-SQL support for this operation?
In other words, I am looking the flink SQL equivalent for prestoSQL's contains() array function(prestodb docs)
Maybe with UNNEST? In this example from the Flink docs, tags is an array:
SELECT users, tag
FROM Orders CROSS JOIN UNNEST(tags) AS t (tag)

Groupby and count() with alias and 'normal' dataframe: python pandas versus mssql

Coming from a SQL environment, I am learning some things in Python Pandas. I have a question regarding grouping and aggregates.
Say I group a dataset by Age Category and count the different categories. In MSSQL I would write this:
SELECT AgeCategory, COUNT(*) AS Cnt
FROM TableA
GROUP BY AgeCategory
ORDER BY 1
The result set is a 'normal' table with two columns, the second column I named Count.
When I want to do the equivalent in Pandas, the groupby object is different in format. So now I have to reset the index and rename the column in a following line. My code would look like this:
grouped = df.groupby('AgeCategory')['ColA'].count().reset_index()
grouped.columns = ['AgeCategory', 'Count']
grouped
My question is if this can be accomplished in one go. Seems like I am over-doing it, but I lack experience.
Thanks for any advise.
Regards, M.
Use parameter name in DataFrame.reset_index:
grouped = df.groupby('AgeCategory')['ColA'].count().reset_index(name='Count')
Or:
grouped = df.groupby('AgeCategory').size().reset_index(name='Count')
Difference is GroupBy.count exclude missing values, GroupBy.size not.
More information about aggregation in pandas.

List of strings to attribute list

I have a report which was built on MDX-query:
SELECT {[Measures].[IssueOpened] } ON COLUMNS,
{( STRTOSET("[Assigned To].[Id].[Some],[Assigned To].[Id].[Another]") *
[Priorities].[Id].[Id].ALLMEMBERS ) } ON ROWS
FROM (SELECT (STRTOSET(#createdOn) ) ON COLUMNS
FROM [Reports])
I want to change static string "[Assigned To].[Id].[Some]:[Assigned To].[Id].[Another]" to parameter:
SELECT {[Measures].[IssueOpened] } ON COLUMNS,
{( STRTOSET(#assignedTo) *
[Priorities].[Id].[Id].ALLMEMBERS ) } ON ROWS
FROM (SELECT (STRTOSET(#createdOn) ) ON COLUMNS
FROM [Reports])
I have created parameter, but Available values for this paramater is relation dataset (not MDX dimension). Allow multiple values set to Yes.
How can I convert value of parameter to list of atributes: "[Assigned To].[Id].[Some],[Assigned To].[Id].[Another]"?
One way would be to create CLR stored procedure for analysis services which will do it for you, so it would build the SET for you. You can find some examples on google. (i.e. http://andrewdenhertog.com/analysis-services/clr-stored-procedures-in-sql-server-analysis-services-ssas/)
If these come from a relational data source I just encode them in the format that MDX is expecting for the parameter value property for example:
Parameter Label: Some
Parameter Value: [Assigned To].[Id].[Some]
Some time this turns out to be easy to create in TSQL Other times you need to do a little hacking with expressions if you need to support dynamic hierarchies. role playing dimensions would be an example. The basic concept is similar though.

Equivalent of "IN" that uses AND instead of OR logic?

I know I'm breaking some rules here with dynamic SQL but I still need to ask. I've inherited a table that contains a series of tags for each ticket that I need to pull records from.
Simple example... I have an array that contains "'Apples','Oranges','Grapes'" and I am trying to retrieve all records that contain ALL items contained within the array.
My SQL looks like this:
SELECT * FROM table WHERE basket IN ( " + fruitArray + " )
Which of course would be the equivalent of:
SELECT * FROM table WHERE basket = 'Apples' OR basket = 'Oranges' OR basket = 'Grapes'
I'm curious if there is a function that works the same as IN ( array ) except that it uses AND instead of OR so that I can obtain the same results as:
SELECT * FROM table WHERE basket LIKE '%Apples%' AND basket LIKE '%Oranges%' AND basket LIKE '%Grapes%'
I could probably just generate the entire string manually, but would like a more elegant solution if at all possible. Any help would be appreciated.
This is a very common problem in SQL. There are basically two solutions:
Match all rows in your list, group by a column that has a common value on all those rows, and make sure the count of distinct values in the group is the number of elements in your array.
SELECT basket_id FROM baskets
WHERE basket IN ('Apples','Oranges','Grapes')
GROUP BY basket_id
HAVING COUNT(DISTINCT basket) = 3
Do a self-join for each distinct value in your array; only then you can compare values from multiple rows in one WHERE expression.
SELECT b1.basket_id
FROM baskets b1
INNER JOIN baskets b2 USING (basket_id)
INNER JOIN baskets b3 USING (basket_id)
WHERE (b1.basket, b2.basket, b3.basket) = ('Apples','Oranges','Grapes')
There may be something like that in full text search, but in general, I sincerely doubt such an operator would be very useful, outside the conjunction with LIKE.
Consider:
SELECT * FROM table WHERE basket ='Apples' AND basket = 'Oranges'
it would always match zero rows.
If basket is a string, like your example suggests, then the closest you could get would be to use LIKE '%apples%oranges%grapes%', which could be built easily with '%'.implode('%', $tags).'%'
The issue with this is if some of 'tags' might be contained in other tags, e.g. 'supercalifragilisticexpialidocious' LIKE '%super%' will be true.
If you need to do LIKE comparisons, I think you're out of luck. If you are doing exact comparisons invovling matching sets in arbitrary order, you should look into the INTERSECT and EXCEPT options of the SELECT statement. They're a bit confusing, but can be quite powerful. (You'd have to parse your delimited strings into tabular format, but of course you're doing that anyway, aren't you?)
Are the items you're searching for always in the same order within the basket? If yes, a single LIKE should suffice:
SELECT * FROM table WHERE basket LIKE '%Apples%Oranges%Grapes%';
And concatenating your array into a string with % separators should be trivial.

Resources