How to add fields dynamically to snowflake's object_construct function - snowflake-cloud-data-platform

I have a large table of data in Snowflake that contains many fields with a name prefix of 'RAW_'. In order to make my table more manageable, I wish to condense all of these 'RAW_' fields into just one field called 'RAW_KEY_VALUE', condensing all of it into a key-value object store.
It initially appeared that Snowflake's 'OBJECT_CONSTRUCT' function was going to be my perfect solution here. However, the issue with this function is that it requires a manual input/hard coding of the fields you wish to convert to a key-value object. This is problematic for me as I have anywhere from 90-120 fields I would need to manually place in this function. Additionally, these fields with a 'RAW_' prefix change all the time. It is therefore critical that I have a solution that allows me to dynamically add these fields and convert them to a key-value store. (I haven't tried creating a stored procedure for this yet but will if all else fails)
Here is a snippet of the data in question
create or replace table reviews(name varchar(50), acting_rating int, raw_comments varchar(50), raw_rating int, raw_co varchar(50));
insert into reviews values
('abc', 4, NULL, 1, 'NO'),
('xyz', 3, 'some', 1, 'haha'),
('lmn', 1, 'what', 4, NULL);
Below is the output I'm trying to achieve (using the manual input/hard coding approach with object_construct)
select
name ,
acting_rating ,
object_construct_keep_null ('raw_comments',raw_comments,'raw_rating',raw_rating,'raw_co',raw_co) as RAW_KEY_VALUE
from reviews;
The above produces this desired output below.
Please let me know if there are any other ways to approach here. I think if I was able to work out a way to add the relevant fields to the object_construct function dynamically, that would solve my problem.

You can do this with a JS UDF and object_construct(*):
create or replace function obj_with_prefix(PREFIX string, A variant)
returns variant
language javascript
as $$
let result = {};
for (key in A) {
if (key.startsWith(PREFIX))
result[key] = A[key];
}
return result
$$
;
Test:
with data(aa_1, aa_2, bb_1, aa_3) as (
select 1,2,3,4
)
select obj_with_prefix('AA', object_construct(*))
from data

Related

Creating index on specific JSON value inside an object array

So let's say I have a varchar column in a table with some structure like:
{
"Response":{
"DataArray":[
{
"Type":"Address",
"Value":"123 Fake St"
},
{
"Type":"Name",
"Value":"John Doe"
}
]
}
}
And I want to create a persisted computed column on the "Value" field of the "DataArray" array element that contains a Type field that equals "Name". (I hope I explained that properly. Basically I want to index the people names on that structure).
The problem is that, unlike with other json objects, I can't use the JSON_VALUE function in a straightforward way to extract said value. I've no idea if this can be done, I've been dabbling with JSON_QUERY but so far I've no idea what to do.
Any ideas and help appreciated. Thanks!
You could achieve it using function:
CREATE FUNCTION dbo.my_func(#s NVARCHAR(MAX))
RETURNS NVARCHAR(100)
WITH SCHEMABINDING
AS
BEGIN
DECLARE #r NVARCHAR(100);
SELECT #r = Value
FROM OPENJSON(#s,'$.Response.DataArray')
WITH ([Type] NVARCHAR(100) '$.Type', [Value] NVARCHAR(100) '$.Value')
WHERE [Type] = 'Name';
RETURN #r;
END;
Defining table:
CREATE TABLE tab(
val NVARCHAR(MAX) CHECK (ISJSON(val) = 1),
col1 AS dbo.my_func(val) PERSISTED -- calculated column
);
Sample data:
INSERT INTO tab(val) VALUES (N'{
"Response":{
"DataArray":[
{
"Type":"Address",
"Value":"123 Fake St"
},
{
"Type":"Name",
"Value":"John Doe"
}
]
}
}');
CREATE INDEX idx ON tab(col1); -- creating index on calculated column
SELECT * FROM tab;
db<>fiddle demo
You could use a computed column with PATINDEX and index that:
CREATE TABLE foo (a varchar(4000), a_ax AS (IIF(PATINDEX('%bar%', a) > 0, SUBSTRING(a, PATINDEX('%bar%', a), 42), '')))
CREATE INDEX foo_x ON foo(a_ax)
You could use a scalar function as #Lukasz Szozda posted - it's a good solution for this.
The problem, however, with T-SQL scalar UDFs in computed columns is that they destroy the performance of any query that table is involved in. Not only does data modification (inserts, updates, deletes) slow down, any execution plans for queries that involve that table cannot leverage a parallel execution plan. This is the case even when the computed column is not referenced in the query. Even index builds lose the ability to leverage a parallel execution plan. Note this article: Another reason why scalar functions in computed columns is a bad idea by Erik Darling.
This is not as pretty but, if performance is important than this will get you the results you need without the drawbacks of a scalar UDF.
CREATE TABLE dbo.jsonStrings
(
jsonString VARCHAR(8000) NOT NULL,
nameTxt AS (
SUBSTRING(
SUBSTRING(jsonString,
CHARINDEX('"Value":"',jsonString,
CHARINDEX('"Type":"Name",',jsonString,
CHARINDEX('"DataArray":[',jsonString)+12))+9,8000),1,
CHARINDEX('"',
SUBSTRING(jsonString,
CHARINDEX('"Value":"',jsonString,
CHARINDEX('"Type":"Name",',jsonString,
CHARINDEX('"DataArray":[',jsonString)+12))+9,8000))-1)) PERSISTED
);
INSERT dbo.jsonStrings(jsonString)
VALUES
('{
"Response":{
"DataArray":[
{
"Type":"Address",
"Value":"123 Fake St"
},
{
"Type":"Name",
"Value":"John Doe"
}
]
}
}');
Note that, this works well for the structure you posted. It may need to be tweaked depending on what the JSON does and can look like.
A second (and better but more complex) solution would be to take the json path logic from Lukasz Szozda's scalar UDF and get it into a CLR. T-SQL scalar UDFs, when written correctly, do not have the aforementioned problems that T-SQL scalar UDFs do.

MSSQL Data type conversion

I have a pair of databases (one mssql and one oracle), ran by different teams. Some data are now being synchronized regularily by a stored procedure in the mssql table. This stored procedure is calling a very large
MERGE [mssqltable].[Mytable] as s
USING THEORACLETABLE.BLA as t
ON t.[R_ID] = s.[R_ID]
WHEN MATCHED THEN UPDATE SET [Field1] = s.[Field1], ..., [Brokenfield] = s.[BrokenField]
WHEN NOT MATCHED BY TARGET THEN
... another big statement
Field Brokenfield was a numeric one until today, and could take value NULL, 0, 1, .., 24
Now, the oracle team introduced a breaking change today for some reason, changed the type of the column to string and now has values NULL, "", "ALFA", "BRAVO"... in the column. Of course, the sync got broken.
What is the easiest way to fix the sync here? I (Mysql team lead, frontend expert but not so in databases) would usually apply one of our database expert guys here, but all of them are now ill, and the fix must go online today....
I thought of a stored procedure like CONVERT_BROKENFIELD_INT_TO_STRING or so, based on some switch-case, which could be called in that merge statement, but not sure how to do that.
Edit/Clarification:
What I need is a way to make a chunk of SQL code (stored procedure), taking an input of "ALFA" and returning 1, "BRAVO" -> 2, etc. and which can be reused, to avoid writing huge ifs in more then one place.
If you can not simplify the logic for correct values the way #RichardHansell desribed, you can create a crosswalk table for BrokenField to the correct values. Then you can use a common table expression or subquery with a left join to that crosswalk to use in the merge.
create table dbo.BrokenField_Crosswalk (
BrokenField varchar(32) not null primary key
, CorrectedValue int
);
insert into dbo.BrokenField_Crosswalk (BrokenField,CorrectedValue) values
('ALFA', 1)
, ('ALPHA', 1)
, ('BRAVO', 2)
...
go
And your code for the merge would look something like this:
;with cte as (
select o.R_ID
, o.Field1
, BrokenField = cast(isnull(c.CorrectedValue,o.BrokenField) as int)
....
from oracle_table.bla as o
left join dbo.BrokenField_Crosswalk as c
)
merge into [mssqltable].[Mytable] t
using cte as s
on t.[R_ID] = s.[R_ID]
when matched
then update set
[Field1] = s.[Field1]
, ...
, [Brokenfield] = s.[BrokenField]
when not matched by target
then
If they are using names with a letter at the start that goes in a sequence:
A = 1
B = 2
C = 3
etc.
Then you could do something like this:
MERGE [mssqltable].[Mytable] as s
USING THEORACLETABLE.BLA as t
ON t.[R_ID], 1)) - ASCII('A') + 1 = s.[R_ID]
WHEN MATCHED THEN UPDATE SET [Field1] = s.[Field1], ..., [Brokenfield] = s.[BrokenField]
WHEN NOT MATCHED BY TARGET THEN
... another big statement
Edit: but actually I re-read your question and you are talking about [Brokenfield] being the problem column, so my solution wouldn't work.
I don't really understand now, as it seems as though the MERGE statement is updating the oracle table with numbers, so surely you need the mapping to work the other way, i.e. 1 -> ALFA, 2 -> BETA, etc.?

Unnest multiple arrays in parallel

My last question Passing an array to stored to postgres was a bit unclear. Now, to clarify my objective:
I want to create an Postgres stored procedure which will accept two input parameters. One will be a list of some amounts like for instance (100, 40.5, 76) and the other one will be list of some invoices ('01-2222-05','01-3333-04','01-4444-08'). After that I want to use these two lists of numbers and characters and do something with them. For example I want to take each amount from this array of numbers and assign it to corresponding invoice.
Something like that in Oracle would look like this:
SOME_PACKAGE.SOME_PROCEDURE (
789,
SYSDATE,
SIMPLEARRAYTYPE ('01-2222-05','01-3333-04','01-4444-08'),
NUMBER_TABLE (100,40.5,76),
'EUR',
1,
P_CODE,
P_MESSAGE);
Of course, the two types SIMPLEARRAYTYPE and NUMBER_TABLE are defined earlier in DB.
You will love this new feature of Postgres 9.4:
unnest(anyarray, anyarray [, ...])
unnest() with the much anticipated (at least by me) capability to unnest multiple arrays in parallel cleanly. The manual:
expand multiple arrays (possibly of different types) to a set of rows. This is only allowed in the FROM clause;
It's a special implementation of the new ROWS FROM feature.
Your function can now just be:
CREATE OR REPLACE FUNCTION multi_unnest(_some_id int
, _amounts numeric[]
, _invoices text[])
RETURNS TABLE (some_id int, amount numeric, invoice text) AS
$func$
SELECT _some_id, u.* FROM unnest(_amounts, _invoices) u;
$func$ LANGUAGE sql;
Call:
SELECT * FROM multi_unnest(123, '{100, 40.5, 76}'::numeric[]
, '{01-2222-05,01-3333-04,01-4444-08}'::text[]);
Of course, the simple form can be replaced with plain SQL (no additional function):
SELECT 123 AS some_id, *
FROM unnest('{100, 40.5, 76}'::numeric[]
, '{01-2222-05,01-3333-04,01-4444-08}'::text[]) AS u(amount, invoice);
In earlier versions (Postgres 9.3-), you can use the less elegant and less safe form:
SELECT 123 AS some_id
, unnest('{100, 40.5, 76}'::numeric[]) AS amount
, unnest('{01-2222-05,01-3333-04,01-4444-08}'::text[]) AS invoice;
Caveats of the old shorthand form: besides being non-standard to have set-returning function in the SELECT list, the number of rows returned would be the lowest common multiple of each arrays number of elements (with surprising results for unequal numbers). Details in these related answers:
Parallel unnest() and sort order in PostgreSQL
Is there something like a zip() function in PostgreSQL that combines two arrays?
This behavior has finally been sanitized with Postgres 10. Multiple set-returning functions in the SELECT list produce rows in "lock-step" now. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
Arrays are declared by adding [] to the base datatype. You declare them as a parameter the same way you declare regular parameters:
The following function accepts an array of integers and and array of strings and will return some dummy text:
create function array_demo(p_data integer[], p_invoices text[])
returns text
as
$$
select p_data[1] || ' => ' || p_invoices[1];
$$
language sql;
select array_demo(array[1,2,3], array['one', 'two', 'three']);
SQLFiddle demo: http://sqlfiddle.com/#!15/fdb8d/1

SQL Server 2014 - XQuery - get comma-separated List

I have a database table in SQL Server 2014 with only an ID column (int) and a column xmldata of type XML.
This xmldata column contains for example:
<book>
<title>a nice Novel</title>
<author>Maria</author>
<author>Peter</author>
</book>
As expected, I have multiple books, therefore multiple rows with xmldata.
I now want to execute a query for all books, where Peter is an Author. I tried this in some xPath2.0 testers and got to the conclusion that:
/book/author/concat(text(), if(position() != last())then ',' else '')
works.
If you try to port this success into SQL Server 2014 Express it looks like this, which is correctly escaped syntax etc.:
SELECT id
FROM books
WHERE 'Peter' IN (xmldata.query('/book/author/concat(text(), if(position() != last())then '','' else '''')'))
SQL Server however does not seem to support a construction like /concat(...) because of:
The XQuery syntax '/function()' is not supported.
I am at a loss then however, why /text() would work in:
SELECT id, xmldata.query('/book/author/text()')
FROM books
which it does.
My constraints:
I am bound to use SQL Server
I am bound to xpath or something else that can be "injected" as the statement above (if the structure of the xml or the database changes, the xpath above could be changed isolated and the application logic above that constructs the Where clause will not be touched) SEE EDIT
Is there a way to make this work?
regards,
BillDoor
EDIT:
My second constraint boils down to this:
An Application constructs the Where clause by
expression <operator> value(s)
expression is stored in a database and is mapped by the xmlTag eg.:
| tokenname| querystring
| "author" | "xmldata.query(/book/author/text())"
the values are presented by the Requesting user. so if the user asks for the author "Peter" with operator "EQUALS" the application constructs:
xmaldata.query(/book/author/text()) = "Peter"
as where clause.
If the customer now decides that author needs to be nested in an <authors> element, i can simply change the expression in the construction-database and the whole machine keeps running without any changes to code, simply manageable.
So i need a way to achieve that
<xPath> <operator> "Peter"
or any other combination of this three isolated components (see above: "Peter" IN <xPath>...) gets me all of Peters' books, even if there are multiple unsorted authors.
This would not suffice either (its not sqlserver syntax, but you get the idea):
WHERE xmldata.exist('/dossier/client[text() = "$1"]', "Peter") = 1;
because the operator is still nested in the expression, i could not request <> "Peter".
I know this is strange, please don't question the concept as a whole - it has a history :/
EDIT: further clarification:
The filter-rules come into the app in an XML structure basically:
Operator: "EQ"
field: "name"
value "Peter"
evaluates to:
expression = lookupExpressionForField("name") --> "table2.xmldata.value('book/author/name[1]', 'varchar')"
operator = lookUpOperatorMapping("EQ") --> "="
value = FormatValues("Peter") --> "Peter" (if multiple values are passed FormatValues cosntructs a comma seperated list)
the application then builds:
- constructClause(String expression,String operator,String value)
"table2.xmldata.value('book/author/name[1]', 'varchar')" + "=" + "Peter"
then constructs a Select statement with the result as WHERE clause.
it does not build it like this, unescaped, unfiltered for injection etc, but this is the basic idea.
i can influence how the input is Transalted, meaning I can implement the methods:
lookupExpressionForField(String field)
lookUpOperatorMapping(String operator)
Formatvalues(List<String> values) | Formatvalues(String value)
constructClause(String expression,String operator,String value)
however i choose to do, i can change the parameter types, I can freely implement them. The less the better of course. So simply constructing a comma-seperated list with xPath would be optimal (like if i could somewhere just tick "enable /function()-syntax in xPath" in sqlserver and the /concat(if...) would work)
How about something like this:
SET NOCOUNT ON;
DECLARE #Books TABLE (ID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY, BookInfo XML);
INSERT INTO #Books (BookInfo)
VALUES (N'<book>
<title>a nice Novel</title>
<author>Maria</author>
<author>Peter</author>
</book>');
INSERT INTO #Books (BookInfo)
VALUES (N'<book>
<title>another one</title>
<author>Bob</author>
</book>');
SELECT *
FROM #Books bk
WHERE bk.BookInfo.exist('/book/author[text() = "Peter"]') = 1;
This returns only the first "book" entry. From there you can extract any portion of the XML field using the "value" function.
The "exist" function returns a boolean / BIT. This will scan through all "author" nodes within "book", so there is no need to concat into a comma-separated list only for use in an IN list, which wouldn't work anyway ;-).
For more info on the "value" and "exist" functions, as well as the other functions for use with XML data, please see:
xml Data Type Methods

h2: "data conversion error" on array returned from stored procedure

This is a followup post to this post. I am writing an accounting system backed by an h2 database. The tree of accounts is stored in the ACCOUNTS table, with the PARENT_ID column storing the links in the tree.
To get the path to a given node in the tree, I have the following stored procedure:
public static Long[] getAncestorPKs(Long id)
whose job is to produce an array of integers, being the PARENT_ID values between the given node and the root of the tree. Let's imagine it is defined like this (because I have tried this and I get the same error):
public static Long[] getAncestorPKs(Long id)
{
return new Long[]{new Long(1), new Long(2), new Long(3)};
}
It is properly registered in the database and I can call it from within a SQL query. My problem is that h2 seems to be unable to deal with the return value: if I use it like this:
SELECT ID FROM ACCOUNTS WHERE ID IN (ANCESTOR_PKS(5))
then I get the following error:
Data conversion error converting "(1, 2, 3)"; SQL statement:
SELECT ID FROM ACCOUNTS WHERE ID IN (ANCESTOR_PKS(5)) [22018-167]
If, instead, I send the following to the database:
SELECT ID FROM ACCOUNTS WHERE ID IN (1, 2, 3)
I get back a result set with 3 rows, containing the three integers (exactly what I expect).
I really can't see what is the problem here! I am returning an array of Longs, which are to be used in comparing against a column which contains BIGINTS. Why is h2 refusing to convert this array? I have tried making the return value be Object[], because the h2 documentation is not entirely clear whether this is required on the return side as well as on the call side, but that makes no difference at all. I'm just banging my head against a brick wall here. This ain't rocket science! Surely someone has written similar code before?
Many thanks in advance, before I go mad!
If the method returns an array of objects, then for the database this is one value of data type ARRAY. And not a table with 3 rows. But of course you don't use the data type ARRAY in your table, you use INT or BIGINT. So your query is incorrect.
Either the method needs to return a ResultSet, or you need to convert the array value to a table. To do that, you could use the function TABLE(..) as follows:
select x from table(x bigint = getAncestorPKs(1));
So what you could do is:
drop table accounts;
create table accounts(id int);
insert into accounts values(1), (2), (10), (20);
drop alias getAncestorPKs;
create alias getAncestorPKs as 'Long[] getAncestorPKs(Long id) {
return new Long[]{new Long(1), new Long(2), new Long(3)};
}';
select * from accounts where id in
(select x from table(x bigint = getAncestorPKs(1)));

Resources