Create UDAF(not UDTF) in Snowflake - snowflake-cloud-data-platform

Java UDFs return a scalar result. Java UDTFs are not currently supported.reference
That said, I created a Java UDF as given below
CREATE OR replace function MAP_COUNT(colValue String)
returns OBJECT
language java
handler='Frequency.calculate'
target_path='#~/Frequency.jar'
as
$$
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
class Frequency {
Map<String, Integer> frequencies = new HashMap<>();
public Map<String, Integer> calculate(String colValue) {
frequencies.putIfAbsent(colValue, 0);
frequencies.computeIfPresent(colValue, (key, value) -> value + 1);
return frequencies;
}
}
$$;
Using MAP_COUNT UDF in a query as below
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select MAP_COUNT(a.my_col) from temp_1 a;
I get result as below
|MAP_COUNT(A.MY_COL) |
|-------------------------------|
|{ "John": "1" } |
|{ "John": "2" } |
|{ "John": "2", "doe": "1" } |
|{ "John": "2", "doe": "2"} |
The result I expect from my UDF is as below
|MAP_COUNT(A.MY_COL) |
|-------------------------------|
|{ "John": "2", "doe": "2"} |
Is it possible in snowflake?
What if I have query like below?
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select MAP_COUNT(a.my_col) as names, MAP_COUNT(a.age) as ages from temp_1 a;
The result I expect from my UDF is as below
|names ||AGES |
|-------------------------------||-------------------------------|
|{ "John": "2", "doe": "2"} ||{ "27": "2", "28": "2"} |
There are ways to achieve this by simply restructuring query but I want to know if it is possible to do by using MAP_COUNT function similar to OBJECT_AGG function in select clause.

When you run a query that uses a UDF, not all rows will necessarily go to the same instance of the UDF. For example, let's say that you're selecting from a table, and you do:
SELECT MyUdf(x) FROM T
Here T may have multiple micro-partitions, and the way that it executes is actually similar to:
SELECT MyUdf(x) FROM T_part1 UNION ALL
SELECT MyUdf(x) FROM T_part2 UNION ALL
SELECT MyUdf(x) FROM T_part3 UNION ALL
SELECT MyUdf(x) FROM T_part4
Here there are four separate instances of MyUdf, and each one sees just a subset of the rows from T as a whole.
Going back to your example, you're trying to emulate a user-defined aggregate function, where a particular instance of the UDF sees every row. The way to guarantee this would be to aggregate in advance, e.g.:
CREATE OR replace function MAP_COUNT(colValue array)
returns OBJECT
language java
handler='Frequency.calculate'
target_path='#~/Frequency.jar'
as
$$
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
class Frequency {
public Map<String, Integer> calculate(String[] colValues) {
Map<String, Integer> frequencies = new HashMap<>();
for (String colValue : colValues) {
frequencies.putIfAbsent(colValue, 0);
frequencies.computeIfPresent(colValue, (key, value) -> value + 1);
}
return frequencies;
}
}
$$;
(Note that I changed the UDF and method signatures to use array and String[], respectively.) Now use it in a query:
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select
MAP_COUNT(ARRAY_AGG(a.my_col)) as names,
MAP_COUNT(ARRAY_AGG(a.age)) as ages
from temp_1 a;
This gives me:
names ages
{ "John": "2", "doe": "2" } { "27": "2", "28": "2" }
There are still two problems here, notably:
This doesn't scale very well. If the size of either array exceeds 16MB (the maximum value size), the query will fail.
The syntax is clunky. Ideally, you'd just use the UDF like any other aggregate function rather than having to wrap your inputs in ARRAY_AGG.
The good news is that both of these problems will be addressed once Java UDAFs are available at some point in the future.

You can achieve much of this desired functionality using Javascript UDTF's today. In particular, a UDTF can be configured to only return a (aggregated) result at the end of a "grouping". Here is an example:
CREATE OR REPLACE FUNCTION MAP_COUNT(COLVALUE varchar)
RETURNS TABLE (FREQUENCIES variant)
LANGUAGE JAVASCRIPT
AS '
{
initialize: function (argumentInfo, context) {
this.freq = new Map();
},
processRow: function (row, rowWriter, context) {
freqVal = this.freq[row.COLVALUE];
this.freq[row.COLVALUE] = (freqVal == undefined ? 1 : 1 + freqVal);
},
finalize: function (rowWriter, context) {
rowWriter.writeRow({FREQUENCIES: this.freq});
}
}
'
;
Create a temp table to test this on:
create or replace temporary table mytemp as
with temp_1 as
(
SELECT 'John' AS my_col, 27 as age
UNION ALL
SELECT 'John' AS my_col, 28 as age
UNION ALL
SELECT 'doe' AS my_col, 27 as age
UNION ALL
SELECT 'doe' AS my_col, 28 as age
)
select * from temp_1;
Run the UDTF as a single partition result:
select agg.* from myTemp,
table(map_count(my_col)) agg
;
Result: { "John": 2, "doe": 2 }
Run the UDTF partitioned by separate groupings:
select my_col, agg.* from myTemp,
table(map_count(age::varchar) over (partition by my_col)) agg
;
Result:
doe { "27": 1, "28": 1 }
John { "27": 1, "28": 1 }

Related

Renaming a JSON column for a UNION

Remark: my example is overly simplified. In reality, I am dealing with a huge query. But to illustrate the issue/errors, let us resort to apples and oranges.
My original query looked like this:
SELECT 'FruitsCount' AS "Type", (SELECT count(id) as Counter, [Name] FROM Fruits group by name FOR JSON PATH) AS "Value"
Which would result in something like. Let's refer to this as Format A
|---------------------|------------------------------------------------------------------------------|
| Type | Value |
|---------------------|------------------------------------------------------------------------------|
| FruitCount | [{"Counter":2, "Name":"Apple"},{"Counter":3, "Name":"Orange"}] |
|---------------------|------------------------------------------------------------------------------|
However, now I want to create a union of Fruit and Vegetable counts. My query now looks like this
(SELECT count(id) as Counter, [Name] FROM Fruits group by name
UNION
SELECT count(id) as Counter, [Name] FROM Vegetables group by name)
FOR JSON PATH
|---------------------|------------------------------------------------------------------------------|
| JSON_F52E2B61-18A1-11d1-B105-00805F49916B |
|---------------------|------------------------------------------------------------------------------|
| [{"Counter":2, "Name":"Apple"},{"Counter":3, "Name":"Orange"},{"Counter":7, "Name":"Tomato"}] |
|---------------------|------------------------------------------------------------------------------|
However, I want it in the format as before, where I have a Type and Value columns (Format A).
I tried doing the following:
SELECT 'FruitsCount' AS "Type", ((SELECT count(id) as Counter, [Name] FROM Fruits group by name
UNION
SELECT count(id) as Counter, [Name] FROM Vegetables group by name) FOR JSON PATH) as "Value"
However, I am presented with Error 156: Incorrect syntax near the keyword 'FOR'.
Then I tried the following:
SELECT 'FruitsAndVegCount' AS "Type", (SELECT count(id) as Counter, [Name] FROM Fruits group by name
UNION
SELECT count(id) as Counter, [Name] FROM Vegetables group by name FOR JSON PATH) as "Value"
However, I am presented with Error 1086: The FOR XML and FOR JSON clauses are invalid in views, inline functions, derived tables, and subqueries when they contain a set operator.
I'm stuck in trying to get my "union-ized" query to be in Format A.
Update 1: Here is the desired output
|---------------------|------------------------------------------------------------------------------------------------|
| Type | Value |
|---------------------|------------------------------------------------------------------------------------------------|
| FruitAndVegCount | [{"Counter":2, "Name":"Apple"},{"Counter":3, "Name":"Orange"},{"Counter":7, "Name":"Tomato"}] |
|---------------------|------------------------------------------------------------------------------------------------|
The goal is to only have a single row, with 2 columns (Type, Value) where Type is whatever I specify (i.e. FruitAndVegCount) and Value is a JSON of the ResultSet that is created by the union query.
If I understand the question correctly, the following statement is an option:
SELECT
[Type] = 'FruitAndVegCount',
[Value] = (
SELECT Counter, Name
FROM (
SELECT count(id) as Counter, [Name] FROM Fruits group by name
UNION ALL
SELECT count(id) as Counter, [Name] FROM Vegetables group by name
) t
FOR JSON PATH
)
You could do it with two columns, Type and Value, as follows. Something like this
select 'FruitAndVegCount' as [Type],
(select [Counter], [Name]
from (select count(id) as Counter, [Name] from #Fruits group by [name]
union all
select count(id) as Counter, [Name] from #Vegetables group by [name]) u
for json path) [Value];
Output
Type Value
FruitAndVegCount [{"Counter":2,"Name":"apple"},{"Counter":1,"Name":"pear"},{"Counter":2,"Name":"carrot"},{"Counter":1,"Name":"kale"},{"Counter":2,"Name":"lettuce"}]

Flatten and aggregate two columns of arrays via distinct in Snowflake

Table structure is
+------------+---------+
| Animals | Herbs |
+------------+---------+
| [Cat, Dog] | [Basil] |
| [Dog, Lion]| [] |
+------------+---------+
Desired output (don't care about sorting of this list):
unique_things
+------------+
[Cat, Dog, Lion, Basil]
First attempt was something like
SELECT ARRAY_CAT(ARRAY_AGG(DISTINCT(animals)), ARRAY_AGG(herbs))
But this produces
[[Cat, Dog], [Dog, Lion], [Basil], []]
Since the distinct is operating on each array, not looking at distinct components within all arrays
If I understand your requirements right and assuming the source tables of
insert into tabarray select array_construct('cat', 'dog'), array_construct('basil');
insert into tabarray select array_construct('lion', 'dog'), null;
I would say the result would look like this:
select array_agg(distinct value) from
(
select
value from tabarray
, lateral flatten( input => col1 )
union all
select
value from tabarray
, lateral flatten( input => col2 ))
;
UPDATE
It is possible without using FLATTEN, by using ARRAY_UNION_AGG:
Returns an ARRAY that contains the union of the distinct values from the input ARRAYs in a column.
For sample data:
CREATE OR REPLACE TABLE t AS
SELECT ['Cat', 'Dog'] AS Animals, ['Basil'] AS Herbs
UNION SELECT ['Dog', 'Lion'], [];
Query:
SELECT ARRAY_UNION_AGG(ARRAY_CAT(Animals, Herbs)) AS Result
FROM t
or:
SELECT ARRAY_UNION_AGG(Animals) AS Result
FROM (SELECT Animals FROM t
UNION ALL
SELECT Herbs FROM t);
Output:
You could flatten the combined array and then aggregate back:
SELECT ARRAY_AGG(DISTINCT F."VALUE") AS unique_things
FROM tab, TABLE(FLATTEN(ARRAY_CAT(tab.Animals, tab.Herbs))) f
Here is another variation to handle NULLs in case they appear in data set.
SELECT ARRAY_AGG(DISTINCT a.VALUE) unique_things from tab, TABLE (FLATTEN(array_compact(array_append(tab.Animals, tab.Herbs)))) a

Returning Field names as part of a SQL Query

I need to write a Sql Satement that gets passed any valid SQL subquery, and return the the resultset, WITH HEADERS.
Somehow i need to interrogate the resultset, get the fieldnames and return them as part of a "Union" with the origional data, then pass the result onwards for exporting.
Below my attempt: I have a Sub-Query Callled "A", wich returns a dataset and i need to query it for its fieldnames. ?ordinally maybe?
select A.fields[0].name, A.fields[1].name, A.fields[2].name from
(
Select 'xxx1' as [Complaint Mechanism] , 'xxx2' as [Actual Achievements]
union ALL
Select 'xxx3' as [Complaint Mechanism] , 'xxx4' as [Actual Achievements]
union ALL
Select 'xxx5' as [Complaint Mechanism] , 'xxx6' as [Actual Achievements] ) as A
Any pointers would be appreciated (maybe i am just missing the obvious...)
The Resultset should look like the table below:
F1 F2
--------------------- ---------------------
[Complaint Mechanism] [Actual Achievements]
xxx1 xxx2
xxx3 xxx4
xxx5 xxx6
If you have a static number of columns, you can put your data into a temp table and then query tempdb.sys.columns to get the column names, which you can then union on top of your data. If you will have a dynamic number of columns, you will need to use dynamic SQL to build your pivot statement but I'll leave that up to you to figure out.
The one caveat here is that all data under your column names will need to be converted to strings:
select 1 a, 2 b
into #a;
select [1] as FirstColumn
,[2] as SecondColumn
from (
select column_id
,name
from tempdb.sys.columns
where object_id = object_id('tempdb..#a')
) d
pivot (max(name)
for column_id in([1],[2])
) pvt
union all
select cast(a as nvarchar(100))
,cast(b as nvarchar(100))
from #a;
Query Results:
| FirstColumn | SecondColumn |
|-------------|--------------|
| a | b |
| 1 | 2 |

Segregation Data in SQL

I have a complex Id in my Data Base table Like is 00-000-000.
All names store in one field
01-000-000=Warehouse
01-001-000-=Rack
01-001-001=Bin cart
into the same table. I want to segregate data in 3 different fields. Is it possible in SQL?
there is another method with parsename function
select PARSENAME(replace(left(FieldName,10),'-','.'),3) col1,
PARSENAME(replace(left(FieldName,10),'-','.'),2) col2,
PARSENAME(replace(left(FieldName,10),'-','.'),1) col3 from yourTable
As we don't have quite enough information I am assuming that you wish to split the 3 number identifier at the start of each row into the 3 separate numbers...
If the Ids are always of fixed length (i.e. 2 letters then a dash then 3 letters then a dash then 3 more letters), you can use the substring function to break them out;
WITH TestData as (
SELECT '01-000-000=Warehouse' AS Id
UNION
SELECT '01-001-000-=Rack' AS Id
UNION
SELECT '01-001-001=Bin cart' AS Id
)
SELECT
Id,
substring(Id, 0, 2) AS FirstId,
substring(Id, 4, 3) AS SecondId,
substring(Id, 8, 3) AS ThirdId,
substring(Id, 11, len(id) - 10) AS RestOfString
FROM TestData
If they are variable lengths you will have to use something like the CHARINDEX function to find the positions of the dashes, and then split on them.
If it's always going to be the same length then you can do some simple code using LEFT and RIGHT
Test Data;
IF OBJECT_ID('tempdb..#TestData') IS NOT NULL DROP TABLE #TestData
GO
CREATE TABLE #TestData (FieldName varchar(50))
INSERT INTO #TestData (FieldName)
VALUES
('01-000-000=Warehouse')
,('01-001-000-=Rack')
,('01-001-001=Bin cart')
Query;
SELECT
FieldName
,LEFT(FieldName,2) Result1
,RIGHT(LEFT(FieldName,6),3) Result2
,RIGHT(LEFT(FieldName,10),3) Result3
FROM #TestData
Result;
FieldName Result1 Result2 Result3
01-000-000=Warehouse 01 000 000
01-001-000-=Rack 01 001 000
01-001-001=Bin cart 01 001 001
Use this for dynamic value with 2 "-" sign.
SELECT SUBSTRING('001-0011-0010',1,CHARINDEX('-','001-0011-0010')-1) COLA,
SUBSTRING ('001-0011-0010',
CHARINDEX('-','001-0011-0010')+1,
CHARINDEX('-','001-0011-0010',
CHARINDEX('-','001-0011-0010')+1)-(CHARINDEX('-','001-0011-0010')+1)
) COLB,
SUBSTRING ('001-0011-0010',(CHARINDEX('-','001-0011-0010',
CHARINDEX('-','001-0011-0010')+1))+1, LEN('001-0011-0010')
) COLC
Thanks

Using union to do a crosstab query

I have a table which has the following structure :
id key data
1 A 10
1 B 20
1 C 30
I need to write a query so that i get these keys as columns and the value as rows.
Eg :
id A B C
1 10 20 30
I have tried using union and case but i get 3 rows for instead of one
Any suggestion?
The most straightforward way to do this is:
SELECT DISTINCT "id",
(SELECT "data" FROM Table1 WHERE "key" = 'A') AS "A",
(SELECT "data" FROM Table1 WHERE "key" = 'B') AS "B",
(SELECT "data" FROM Table1 WHERE "key" = 'C') AS "C"
FROM Table1
Or you can use a PIVOT:
SELECT * FROM
(SELECT "id", "key", "data" FROM Table1)
PIVOT (
MAX("data")
FOR ("key") IN ('A', 'B', 'C'));
sqlfiddle demo

Resources