How to query multiple JSON document schemas in Snowflake? - snowflake-cloud-data-platform

Could anyone tell me how to change the Stored Procedure in the article below to recursively expand all the attributes of a json file (multiple JSON document schemas)?
https://support.snowflake.net/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling-part-2

Craig Warman's stored procedure posted in that blog is a great idea. I asked him if it was okay to refactor his code, and he agreed. I've used the refactored version in the field, so I know the SP well as well as how it works.
It may be possible to modify the SP to work on your JSON. It will depend on whether or not Snowflake types the JSON in your variant column. The way you have it structured, it may not type everything. You can check by running this SQL and seeing if the result set includes all the columns you need:
set VARIANT_TABLE = 'WEATHER';
set VARIANT_COLUMN = 'V';
with MAIN_TABLE as
(
select * from identifier($VARIANT_TABLE) sample (1000 rows)
)
select distinct REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\[(.+)\\]'),'[^a-zA-Z0-9]','_') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
typeof(f.value) AS attribute_type, -- This generates column datatypes.
path_name AS alias_name -- This generates column aliases based on the path
from
MAIN_TABLE,
LATERAL FLATTEN(identifier($VARIANT_COLUMN), RECURSIVE=>true) f
where TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
Be sure to replace the variables to your table and column names. If this picks up the type information for the columns in your JSON, then it's possible to modify this SP to do what you need. If it doesn't but there's a way to modify the query to get it to pick up the columns, that would work too.
If it doesn't pick up the columns, based on Craig's idea I decided to write type inference for non variant (such as strings from CSV log files without type information). Try the SQL above and see what results first.

Related

Getting length of binary column in snowflake using information schema

As the title suggests, I want to determine the length that I have specified while creating the column of type BINARY in Snowflake. I tried to get this information from Information_Schema.COLUMNS view. But on inspecting the result I did not see any columns that had this information. I thought CHARACTER_OCTET_LENGTH of this view might contain this info but it does not.
I am aware that I can also use SHOW COLUMNS IN TABLE <tab_name> but for my requirement I only want to use the information_schema.
Is this information not stored in the information_schema?
I know you don't want this solution, but I will just put it here for "other people" looking for the same thing.
create table test.test.test_len(bin_10 binary(10), bin_200 binary(200) );
show columns in table test.test.test_len;
select
"column_name" as name
,parse_json("data_type"):length::number as len
from table(RESULT_SCAN());
NAME
LEN
BIN_10
10
BIN_200
200

SQL Server Data Masking bug with "FOR JSON PATH" clause

I'm working with a masked database on my QA server using SQL Server Standard (64-bit) 14.0.1000.169. This is my structure:
CREATE TABLE [dbo].[Test](
[Column1] [VARCHAR(64)] NULL,
[Column2] [VARCHAR(64)] NULL
)
GO
INSERT INTO [dbo].[Test]
VALUES ('ABCDEFG', 'HIJKLMN')
I've masked the column with the following code:
ALTER TABLE [dbo].[Test]
ALTER COLUMN [Column1] VARCHAR(64) MASKED WITH (FUNCTION = 'default()');
It works as expected when I perform the following query using a non-allowed user:
SELECT [Column1], [Column2]
FROM [dbo].[Test]
FOR JSON PATH
-- RESULT: '[{"Column1":"xxxx", "Column2":"HIJKLMN"}]'
But it doesn't work when the same non-allowed user saves the result in variable (the main goal):
DECLARE #var VARCHAR(64)
SET #var = (SELECT [Column1], [Column2] FROM [dbo].[Test] FOR JSON PATH)
SELECT #var --it should show a valid JSON...
-- RESULT: 'xxxx' <-- JSON LOSES ITS STRUCTURE
-- DESIRED RESULT: '[{"Column1":"xxxx", "Column2":"HIJKLMN"}]' <-- VALID JSON
Main problem: JSON looses its structure when a masked column appear in the SELECT and "FOR JSON PATH" clause is present.
We want to get a valid JSON even if the data column is masked or not, or even if sa user or not.
I've tested using NVARCHAR or doing a CAST in the masked column, but the only way we get the desired result is using a #tempTable before use the "FOR JSON PATH" clause.
How can I do for SELECT a masked column and save it to
VARCHAR variables without loose JSON structure?
Any help will be appreciated.
NOTE: The SA user is default allowed to see unmasked data (so the JSON doesn't loose its structure), but we want to execute it on a non-allowed user and return a valid JSON, not only 'xxxx'.
It does indeed appear to be a bug. Repro is here. Although see below, not so sure.
When using FOR JSON, or for that matter FOR XML, as a top level SELECT construct, a different code path is used as compared to placing it in a subquery or assigning it to a variable. This is one of the reasons for the 2033-byte limit per row in a bare FOR JSON.
What appears to be happening is that in the case of a bare FOR JSON, the data masking happens at the top of the plan, in a Compute Scalar operator just before the JSON SELECT operator. So the masking happens on just the one column.
PasteThePlan
Whereas when putting inside a subquery, a UDX function operator is used. The problem is that the Compute Scalar is happening after the UDX has created the JSON or XML, whereas it should have been pushed down below the UDX in the plan.
PasteThePlan
I suggest you file the bug with Microsoft, on the Azure Feedback site.
Having gone over this a little, I actually think now that it's not a bug. What does seem to be a bug is the case without nesting.
From the documentation:
Whenever you project an expression referencing a column for which a data masking function is defined, the expression will also be masked. Regardless of the function (default, email, random, custom string) used to mask the referenced column, the resulting expression will always be masked with the default function.
Therefore, when you select any masked column, even in a normal SELECT, if you use a function on the column then the masking always happens after any other functions. In other words, the masking is not applied when the data is read, it is applied when it is finally output to the client.
When using a subquery, the data is fed into a UDX function operator. The compiler now senses that the final resultset is a normal SELECT, just that it needs to mask any final result that came from the masked column. So the whole JSON is masked as one blob, similar to if you did UPPER(yourMaskedColumn). See the XML plan in this fiddle for an example of that.
But when using a bare FOR JSON, it appears to the compiler as a normal SELECT, just that the final output is changed to JSON (the top-level SELECT operator is different). So the masking happens before that point. This seems to me to be a bug.
The bug is even more egregious when you use FOR XML, which uses the same mechanisms. If you use a nested FOR XML ..., TYPE then you get just <masked /> irrespective of whether you nest it or not. Again this is because the query plan shows the masking happening after the UDX. Whereas if you don't use , TYPE then it depends if you nest it. See fiddle.

Snowflake:Export data in multiple delimiter format

Requirement:
Need the file to be exported as below format, where gender, age, and interest are columns and value after : is data for that column. Can this be achieved while using Snowflake, if not is it possible to export data using Python
User1234^gender:male;age:18-24;interest:fishing
User2345^gender:female
User3456^age:35-44
User4567^gender:male;interest:fishing,boating
EDIT 1: Solution as given by #demircioglu
It displays as NULL values instead of other column values
Below the EMPLOYEES table data
When I ran below query
SELECT 'EMP_ID'||EMP_ID||'^'||'FIRST_NAME'||':'||FIRST_NAME||';'||'LAST_NAME'||':'||LAST_NAME FROM tempdw.EMPLOYEES ;
Create your SQL with the desired format and write it to a file
COPY INTO #~/stage_data
FROM
(
SELECT 'User'||User||'^'||'gender'||':'||gender||';'||'age'||':'||age||';'||'interest'||':'||interest FROM table
)
file_format = (TYPE=CSV compression='gzip')
File format here is not important because each line will be treated as a field because of your delimiter requirements
Edit:
CONCAT function (aliased with ||) returns NULL if you have a NULL value.
In order to eliminate NULLs you can use NVL2 function
So your SQL will have series of NVL2s
NVL2 checks the first parameter and if it's not NULL returns first expression, if it's NULL returns second expression
So for User column
'User'||User||'^' will turn into
NVL2(User,'User','')||NVL2(User,User,'')||NVL2(User,'^','')
P.S. I am leaving up to you to create the rest of the SQL, because Stackoverflow's function is to help find the solution, not spoon feed the solution.
No, I do not believe multiple delimiters like this are supported in Snowflake at this time. Multiple byte and multiple character delimiters are supported, but they will need to be specified as the same delimiter repeated for either record or line.
Yes, it may be possible to do some post-processing or use Python scripts to achieve this. Or even SQL transformative statements. This is not really my area of expertise so if someone has an example for you, I'll let them add to the discussion.

Snowflake - how to display column name using function?

I did not get much help from Snowflake documentation about how I can take give column names using snowflake functions.
I have automated report which will do calculation for the given dates. here it is
select
sum(case when logdate = to_date(dateadd('day', - 10, '2019-11-14')) then eng_fees + data_fees end) AS to_date(dateadd('day', - 10, '2019-11-14'))
from myTable
where logdate = '2019-11-04'
I am getting following output for my column name below
to_date(dateadd('day', - 10, '2019-11-14'))
100
My expected output for my column name
2019-11-04
100
how can I print expected date as column name in Snowflake?
Your statement is using an AS to name your column. Snowflake will treat that as a literal, not a calculation. In order to do what you're requesting inside a function, you'll need to use a Javascript Function, I think. This will allow you to dynamically build the SQL Statement with your calculated column name predefined.
https://docs.snowflake.net/manuals/sql-reference/udf-js.html
You can't have dynamic column names without using external functionality (or TASK scheduling).
You can create a JavaScript Stored Procedure that generates a VIEW where the column names can be set by dynamic parameters / expressions.
The normal way of handling this is to use a reporting tool that can display your fixed column result set with dynamic headers or run dynamic SQL altogether.
I see 2 paths of getting close to what you want:
1) you can simply use UNION to have your headers displayed as the first row and change the actual column aliases into 1....N - so that they provide the number of column - this is going to be the fastest and the cheapest
2) you can use dynamic SQL to generate a query that will have your alias names dynamically filled and then simply run it (and you can use to example PIVOT to construct such query)

Find columns that match in two tables

I need to query two tables of companies in the first table are the full names of companies, and the second table are also the names but are incomplete. The idea is to find the fields that are similar. I put pictures of the reference and SQL code I'm using.
The result I want is like this
The closest way I found to do so:
SELECT DISTINCT
RTRIM(a.NombreEmpresaBD_A) as NombreReal,
b.EmpresaDB_B as NombreIncompleto
FROM EmpresaDB_A a, EmpresaDB_B b
WHERE a.NombreEmpresaBD_A LIKE 'VoIP%' AND b.EmpresaDB_B LIKE 'VoIP%'
The problem with the above code is that it only returns the record specified in the WHERE and if I put this LIKE '%' it returns the Cartesian product of two tables. The RDBMS is Microsoft SQL Server. I would greatly appreciate if you help me with any proposed solution.
Use the short name plus appended '%' as argument in the LIKE expression:
Edit with info that we deal with SQL Server:
SELECT a.NombreEmpresaBD_A as NombreReal
,b.NombreEmpresaBD_B as NombreIncompleto
FROM EmpresaDB_A a, EmpresaDB_B b
WHERE a.NombreEmpresaBD_A LIKE (b.NombreEmpresaBD_B + '%');
According to your screenshot you had the column name wrong!
String concatenation in T-SQL with + operator.
Above query finds a case where
'Computex S.A' LIKE 'Computex%'
but not:
'Voip Service Mexico' LIKE 'VoipService%'
For that you would have to strip blanks first or use more powerful pattern matching functions.
I have created a demo for you on data.SE.
Look up pattern matching or the LIKE operator in the manual.
I would suggest adding a foreign key between the tables linking the data. Then you can just search for the one table and join the second to get the other results.

Resources