How to convert json to table format in Snowflake - snowflake-cloud-data-platform

I have an external table INV_EXT_TBL( 1 variant column named "VALUE") in Snowflake and it has 6000 rows (each row is json file). The json record has has double quote as its in dynamo_json format.
What is the best approach to parse all the json files and convert it into table format to run sql queries. I have given sample format of just top 3 json files.
"{
""Item"": {
""sortKey"": {
""S"": ""DR-1630507718""
},
""vin"": {
""S"": ""1FMCU9GD2JUA29""
}
}
}"
"{
""Item"": {
""sortKey"": {
""S"": ""affc5dd0875c-1630618108496""
},
},
""vin"": {
""S"": ""SALCH625018""
}
}
}"
"{
""Item"": {
""sortKey"": {
""S"": ""affc5dd0875c-1601078453607""
},
""vin"": {
""S"": ""KL4CB018677""
}
}
}"
I created local table and inserted data into it from external table by casting the data type. Is this correct approach OR should i use parse_json function against the json files to store data in local table.
insert into DB.SCHEMA.INV_HIST(VIN,SORTKEY)
(SELECT value:Item.vin.S::string AS VIN, value:Item.sortKey.S::string AS SORTKEY FROM INV_EXT_TBL);```

I resolved this by creating a Materialized view by using cast on variant column on External table. This helped to get rid of outer double-quotes and the performance improved multifold. I did not progress with table creation approach.
CREATE OR REPLACE MATERIALIZED VIEW DB.SCHEMA.MVW_INV_HIST
AS
SELECT value:Item.vin.S::string AS VIN, value:Item.sortKey.S::string AS SORTKEY
FROM DB.SCHEMA.INV_HIST;

Related

Load JSON Data into Snow flake table

My Data is follows:
[ {
"InvestorID": "10014-49",
"InvestorName": "Blackstone",
"LastUpdated": "11/23/2021"
},
{
"InvestorID": "15713-74",
"InvestorName": "Bay Grove Capital",
"LastUpdated": "11/19/2021"
}]
So Far Tried:
CREATE OR REPLACE TABLE STG_PB_INVESTOR (
Investor_ID string, Investor_Name string,Last_Updated DATETIME
); Created table
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
created file format
create or replace stage investor_stage
file_format = investorformat;
created stage
copy into STG_PB_INVESTOR from #investor_stage
I am getting an error:
SQL compilation error: JSON file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
You should be loading your JSON data into a table with a single column that is a VARIANT. Once in Snowflake you can either flatten that data out with a view or a subsequent table load. You could also flatten it on the way in using a SELECT on your COPY statement, but that tends to be a little slower.
Try something like this:
CREATE OR REPLACE TABLE STG_PB_INVESTOR_JSON (
var variant
);
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
create or replace stage investor_stage
file_format = investorformat;
copy into STG_PB_INVESTOR_JSON from #investor_stage;
create or replace table STG_PB_INVESTOR as
SELECT
var:InvestorID::string as Investor_id,
var:InvestorName::string as Investor_Name,
TO_DATE(var:LastUpdated::string,'MM/DD/YYYY') as last_updated
FROM STG_PB_INVESTOR_JSON;

Migrating from Hive Map to Snowflake Variant

We are migrating our code from Hive to Snowflake and so hive map is migrated to snowflake variant. However when we loaded data into snowflake table - we are seeing addition KEY and VALUE string in our data.
Hive MAP data -
{"SD10":"","SD9":""}
SnowSQL Variant data -
[ { "key": "SD10", "value": "" }, { "key": "SD9", "value": "" }]
I am using stage and ORC file to load data from Hadoop to Snowflake.
Is there a way we can store the map data as it is into snowflake variant. Basically I don't want additional KEY and VALUE strings
You can do this with the PARSE_JSON function:
-- Returns { "SD10": "", "SD9": "" }
select parse_json('{"SD10":"","SD9":""}') as JSON;
You can add this to your COPY INTO statement. You can see different options for this in the Snowflake example for Parquet files, which is very similar to loading ORC files in Snowflake: https://docs.snowflake.net/manuals/user-guide/script-data-load-transform-parquet.html.
You can also find more information on transformations on load here: https://docs.snowflake.net/manuals/user-guide/data-load-transform.html
You can do this with UDF.
create or replace function list_to_dict(v variant)
returns variant
language javascript
as'
function listToDict(input){
var returnDictionary = {}
var inputLength = input.length;
for (var i = 0; i < inputLength; i++) {
returnDictionary[input[i]["key"]] = input[i]["value"]
}
return returnDictionary
}
return listToDict(V);
';
select list_to_dict(column)['SD10'] from table;

Snowflake select query to return data in json format

I am making a select call on a table and it always returns 1 row. I would like to get the data in json format.
{
"column_name1": "value1",
"column_name2": "value2",
}
Does snowflake query allows anything like this ?
object_construct is the way to go for this.
For example,
select object_construct(*) from t1;

How to extract data from specific fields in a NESTED JSON using AWS Athena - Presto?

I have JSONs in the below format in a S3 bucket and I'm trying to extract only the "id", "label" & "value" from the "fields" key using Athena. I tried ARRAY-MAP but wasn't successful. Also, on the "value" field - I want the content to be captured as a simple text ignoring any list / dictionaries in it.
I also don't want to create any Hive schema for these JSONs and looking for a Presto SQL solution if possible.
{
"reports":{
"client":{
"pdf":"https://reports.s3-accelerate.amazonaws.com/looks/123/reports/client.pdf",
"html":"https://api.com/looks/123/reports/client.html"
},
"public":{
"pdf":"https://s3.amazonaws.com/reports.com/looks/123/reports/public.pdf",
"html":"https://api.look.com/looks/123/reports/public.html"
}
},
"actors":{
"looker":{
"firstName":"Rosa",
"lastName":"Mart"
},
"client":{
"email":"XXX.XXX#XXXXXX.com",
"firstName":"XXX",
"lastName":"XXX"
}
},
"_id":"123",
"fields":[
{
"id":"fence_condition_missing_sections",
"context":[
"Fence Condition"
],
"label":"Missing Sections",
"type":"choice",
"value":"None"
},
{
"id":"photos_landscaped_area",
"context":[
"Landscaping Photos"
],
"label":"Landscaped Area",
"type":"photo-with-description",
"value":[
{
"description":"Front",
"photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/1.jpg"
},
{
"description":"Front entrance ",
"photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/2.jpg"
}
]
}
],
"jobNumber":"xxx",
"createdAt":"2018-10-11T22:39:37.223Z",
"completedAt":"2018-01-27T20:13:49.937Z",
"inspectedAt":"2018-01-21T23:33:48.718Z",
"type":"ZZZ-commercial",
"name":"Commercial"
}'
expected output:
--------------------------------------------------------------------------------
| ID | LABEL | VALUE |
--------------------------------------------------------------------------------
| photos_landscaped_area | Landscaped Area | [{"description":"Front",...}] |
----------------------------------------------------------------------------
| fence_condition_missing_sections | Missing Sections | None|
----------------------------------------------------------------------------
I'm going to assume your data is in a one-document-per-line format and that you provided a formatted example for readability's sake. If this is incorrect, please see the question Multi-line JSON file querying in hive
.
When the schema of a JSON document is not entirely regular you can create that column as a string column and use the JSON_* functions to extract values out of it.
First you need to create a table for the raw data:
CREATE TABLE data (
fields array<struct<id:string,label:string,value:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://…'
(if you're not interested in the other fields in the JSON documents you can just ignore those when creating the table)
Then you create a view that flattens the data:
CREATE VIEW flat_data AS
SELECT
field.id,
field.label,
field.value
FROM data
CROSS JOIN UNNEST(fields) AS f(field)
Selecting from this view should give you the results you are looking for.
I suspect you are also looking for how to extract properties from the values structure, which is what I alluded to above:
SELECT
label,
JSON_EXTRACT(value, '$.photo') AS photo_urls
FROM flat_data
WHERE id = 'photos_landscaped_area'
Look in the Presto documentation for all available JSON functions.

Convert column into nested field in destination table on load job in Big Query

I am currently running a job to transfer data from one table to another via a query. But I can't seem to find a way convert a column into a nested field containing the column as a child field. For example, I have a column customer_id: 3 and I would like to convert it to {"customer": {"id":3}}. Below is a snippet of my job data.
query='select * FROM ['+ BQ_DATASET_ID+'.'+Table_name+'] WHERE user="'+user+'"'
job_data={"projectId": PROJECT_ID,
'jobReference': {
'projectId': PROJECT_ID,
'job_id': str(uuid.uuid4())
},
'configuration': {
'query': {
'query': query,
'priority': 'INTERACTIVE',
'allowLargeResults': True,
"destinationTable":{
"projectId": PROJECT_ID,
"datasetId": user,
"tableId": destinationTable,
},
"writeDisposition": "WRITE_APPEND"
},
}
}
Unfortunately, if the "customer" RECORD does not exist in the input schema, it is not currently possible to generate that nested RECORD field with child fields through a query. We have features in the works that will allow schema manipulation like this via SQL, but I don't think it's possible to do accomplish this today.
I think your best option today would be an export, transformation to desired format, and re-import of the data to the desired destination table.
a simple solution is to run
select customer_id as customer.id ....

Resources