Load JSON Data into Snow flake table - snowflake-cloud-data-platform

My Data is follows:
[ {
"InvestorID": "10014-49",
"InvestorName": "Blackstone",
"LastUpdated": "11/23/2021"
},
{
"InvestorID": "15713-74",
"InvestorName": "Bay Grove Capital",
"LastUpdated": "11/19/2021"
}]
So Far Tried:
CREATE OR REPLACE TABLE STG_PB_INVESTOR (
Investor_ID string, Investor_Name string,Last_Updated DATETIME
); Created table
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
created file format
create or replace stage investor_stage
file_format = investorformat;
created stage
copy into STG_PB_INVESTOR from #investor_stage
I am getting an error:
SQL compilation error: JSON file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.

You should be loading your JSON data into a table with a single column that is a VARIANT. Once in Snowflake you can either flatten that data out with a view or a subsequent table load. You could also flatten it on the way in using a SELECT on your COPY statement, but that tends to be a little slower.
Try something like this:
CREATE OR REPLACE TABLE STG_PB_INVESTOR_JSON (
var variant
);
create or replace file format investorformat
type = 'JSON'
strip_outer_array = true;
create or replace stage investor_stage
file_format = investorformat;
copy into STG_PB_INVESTOR_JSON from #investor_stage;
create or replace table STG_PB_INVESTOR as
SELECT
var:InvestorID::string as Investor_id,
var:InvestorName::string as Investor_Name,
TO_DATE(var:LastUpdated::string,'MM/DD/YYYY') as last_updated
FROM STG_PB_INVESTOR_JSON;

Related

Query Snowflake Named Internal Stage by Column NAME and not POSITION

My company is attempting to use Snowflake Named Internal Stages as a data lake to store vendor extracts.
There is a vendor that provides an extract that is 1000+ columns in a pipe delimited .dat file. This is a canned report that they extract. The column names WILL always remain the same. However, the column locations can change over time without warning.
Based on my research, a user can only query a file in a named internal stage using the following syntax:
--problematic because the order of the columns can change.
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]dat.gz') t;
Is there anyway to use the column names instead?
E.g.,
Select t.first_name from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]csv.gz') t;
I appreciate everyone's help and I do realize that this is an unusual requirement.
You could read these files with a UDF. Parse the CSV inside the UDF with code aware of the headers. Then output either multiple columns or one variant.
For example, let's create a .CSV inside Snowflake we can play with later:
create or replace temporary stage my_int_stage
file_format = (type=csv compression=none);
copy into '#my_int_stage/fx3.csv'
from (
select *
from snowflake_sample_data.tpcds_sf100tcl.catalog_returns
limit 200000
)
header=true
single=true
overwrite=true
max_file_size=40772160
;
list #my_int_stage
-- 34MB uncompressed CSV, because why not
;
Then this is a Python UDF that can read that CSV and parse it into an Object, while being aware of the headers:
create or replace function uncsv_py()
returns table(x variant)
language python
imports=('#my_int_stage/fx3.csv')
handler = 'X'
runtime_version = 3.8
as $$
import csv
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
class X:
def process(self):
with open(import_dir + 'fx3.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield(row, )
$$;
And then you can read this UDF that outputs a table:
select *
from table(uncsv_py())
limit 10
A limitation of what I showed here is that the Python UDF needs an explicit name of a file (for now), as it doesn't take a whole folder. Java UDFs do - it will just take longer to write an equivalent UDF.
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-functions.html
https://docs.snowflake.com/en/user-guide/unstructured-data-java.html

How to load data into snowflake table from json.gz file

I would like to insert records from my json.gz file into snowflake table.
I created this steps:
CREATE FILE FORMAT test_gz TYPE = JSON
create stage my_test_stage
storage_integration = MY_S3
url = 's3://mybucket/'
file_format = test_gz;
copy into test_table
from #my_test_stage
I have an error: JSON file can produce one and only one column of type variant or object or array.
I also tried to change file format to gzip but it's not working.
All you need to do is to add
CREATE FILE FORMAT test_gz TYPE = JSON COMPRESSION=GZIP instead of just TYPE = JSON
Make sure your dest table looking like this:
CREATE OR REPLACE TABLE test_table (JSON_DATA VARIANT)
Either way you can execute this all at once:
copy into test_table
from 's3://mybucket/'
storage_integration = MY_S3
file_format = (type = json COMPRESSION=GZIP)

CREATE TABLE in AWS Athena with an Anonymous JSON Array

I am working in AWS Athena. I am trying to CREATE EXTERNAL TABLE that has an anonymous JSON array in it. For example:
[{"key1":"value1a","key2":"value2a",...},{"key1":"value1b","key2":"value2b",...}]
I cannot get the table definition to correctly parse the array of objects into a set of rows.
I have tried this:
CREATE EXTERNAL TABLE IF NOT EXISTS testdata.trynum85 (
data struct<`key1`:string,
`key2`:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'strip.outer.array'='true'
) LOCATION 's3://mys3/new-json-structure/'
TBLPROPERTIES ('has_encrypted_data'='false');
and
CREATE EXTERNAL TABLE IF NOT EXISTS testdata.trynum85 (
data array<struct<`key1`:string,
`key2`:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'strip.outer.array'='false'
) LOCATION 's3://mys3/new-json-structure/'
TBLPROPERTIES ('has_encrypted_data'='false');
and with both cases, when I query, the logic only finds the first object in the array.
SELECT stuff.key1
FROM "testdata"."trynum85 ", UNNEST(data) AS t(stuff)
returns
key1
----
value1a
value1b never appears.
If I modify the data and make the array named, i.e.
{"iamme":[{"key1":"value1a","key2":"value2a",...},{"key1":"value1b","key2":"value2b",...}]}
Then queries can yield the list of results on a query for key1.
How do I handle the anonymous array?

COPY INTO query on Snowflake returns TABLE does not exist error

I am trying to load data from azure blob storage.
The data has already been staged.
But, the issue is when I try to run
copy into random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
Below is the error I encounter:
raise error_class(
snowflake.connector.errors.ProgrammingError: 001757 (42601): SQL compilation error:
Table 'random_table_name' does not exist
Basically, it says table does not exist, which it does not, but the syntax on website is the same as mine.
COPY INTO query on Snowflake returns TABLE does not exist error
In my case the table name is case-sensitive. Snowflake seems to convert everything to upper case. I changed the database/schema/table names to all upper-case and it started working.
First run the below query to fetch the column headers
select $1 FROM #stage_name_i_created/filename.csv limit 1
Assuming below are the header lines from your csv file
id;first_name;last_name;email;age;location
Create a file_format csv
create or replace file format semicolon
type = 'CSV'
field_delimiter = ';'
skip_header=1;
Then you should define the datatype and field name as below
create or replace table <yourtable> as
select $1::varchar as id
,$2::varchar as first_name
,$3::varchar as last_name
,$4::varchar as email
,$5::int as age
,$6::varchar as location
FROM #stage_name_i_created/yourfile.csv
(file_format => semicolon );
The table must exist prior to running a COPY INTO command. In your post, you say that the table does not exist...so that is your issue.
If your table exist, try by forcing the table path like this:
copy into <database>.<schema>.<random_table_name>
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv'
or by steps like this:
use database <database_name>;
use schema <schema_name>;
copy into database.schema.random_table_name
from #stage_name_i_created
file_format = (type='csv')
pattern ='*.csv';
rbachkaniwala, what do you mean by 'How do I create a table?( according to snowflake syntax it is not possible to create empty tables)'.
You can just do below to create a table
CREATE TABLE random_table_name (FIELD1 VARCHAR, FIELD2 VARCHAR)
The table does need to exist. You should check the documentation for COPY INTO.
Other areas to consider are
do you have the right context set for the database & schema
does the user / role have access to the table or object.
It basically seems like you don't have the table defined yet. You should
ensure the table is created
ensure all columns in the CSV exist as columns in the table
ensure the order of the columns are the same as in the CSV
I'd check data types too.
"COPY INTO" is not a query command, it is the actual data transfer execution from source to destination, which both must exist as others commented here but If you want just to query without loading the files then run the following SQL:
//Display list of files in the stage to verify stage
LIST #stage_name_i_created;
//Create a file format
CREATE OR REPLACE FILE FORMAT RANDOM_FILE_CSV
type = csv
COMPRESSION = 'GZIP' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 0 FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N');
//Now select the data in the files
Select $1 as first_col,$2 as second_col //can add as necessary number of columns ...etc
from #stage_name_i_created
(FILE_FORMAT => RANDOM_FILE_CSV)
More information can be found in the documentation link here
https://docs.snowflake.com/en/user-guide/querying-stage.html

How to use inline file format to query data from stage in Snowflake data warehouse

Is there any way to query data from a stage with an inline file format without copying the data into a table?
When using a COPY INTO table statement, I can specify an inline file format:
COPY INTO <table>
FROM (
SELECT ...
FROM #my_stage/some_file.csv
)
FILE_FORMAT = (
TYPE = CSV,
...
);
However, the same thing doesn't work when running the same select query directly, outside of the COPY INTO command:
SELECT ...
FROM #my_stage/some_file.csv
(FILE_FORMAT => (
TYPE = CSV,
...
));
Instead, the best I can do is to use a pre-existing file format:
SELECT ...
FROM #my_stage/some_file.csv
(FILE_FORMAT => 'my_file_format');
But this doesn't allow me to programatically change the file format when creating the query. I've tried every syntax variation possible, but this just doesn't seem to be supported right now.
I don't believe it is possible but, as a workaround, can't you create the file format programatically, use that named file format in your SQL and then, if necessary, drop it?

Resources