how to Read headers of a CSV file in Snowflake stage

how to Read headers of a CSV file in Snowflake stage - snowflake-cloud-data-platform

I am learning snowflake ,I was enter image description here trying to read the headers of CSV file stored in aws bucket ..I used the metadata fields that required me to input $1,$2 as column names and so on to obtain headers(for copy into table creation)..
is there a better alternative to this?
Statement :
select
Top 100 metadata$filename,
metadata$file_row_number,
t.$1,
t.$2,
t.$3,
t.$4,
t.$5,
t.$6
from
#aws_stage t
where
metadata$filename = 'OrderDetails.csv'

Related

Snowflake: How to extract file name from the empty file

I have a requirement where input file name has to be captured and stored in the snowflake table. I am using snowflake snowpipe & stage to query the file which is in s3. My code/query works fine when the input file has data, however if the input file is empty, copy command is not getting the file name.
How to get the file name when the input file is empty/zero byte ? Thanks
Snowpipe syntax
CREATE OR REPLACE PIPE DEV.schema.load_pipe auto_ingest = true
AS
COPY INTO schema.TMP_table FROM (
SELECT
$1::variant AS MESSAGE,
SPLIT_PART(METADATA$FILENAME,'/',4) AS FILE_NAME,
current_timestamp::timestamp_ntz as LOAD_TS
FROM #DEV.STAGE.DEV_STG/input/
)
pattern = 'File_DLYERR_.*'

Query Snowflake Named Internal Stage by Column NAME and not POSITION

My company is attempting to use Snowflake Named Internal Stages as a data lake to store vendor extracts.
There is a vendor that provides an extract that is 1000+ columns in a pipe delimited .dat file. This is a canned report that they extract. The column names WILL always remain the same. However, the column locations can change over time without warning.
Based on my research, a user can only query a file in a named internal stage using the following syntax:
--problematic because the order of the columns can change.
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]dat.gz') t;
Is there anyway to use the column names instead?
E.g.,
Select t.first_name from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]csv.gz') t;
I appreciate everyone's help and I do realize that this is an unusual requirement.

You could read these files with a UDF. Parse the CSV inside the UDF with code aware of the headers. Then output either multiple columns or one variant.
For example, let's create a .CSV inside Snowflake we can play with later:
create or replace temporary stage my_int_stage
file_format = (type=csv compression=none);
copy into '#my_int_stage/fx3.csv'
from (
select *
from snowflake_sample_data.tpcds_sf100tcl.catalog_returns
limit 200000
)
header=true
single=true
overwrite=true
max_file_size=40772160
;
list #my_int_stage
-- 34MB uncompressed CSV, because why not
;
Then this is a Python UDF that can read that CSV and parse it into an Object, while being aware of the headers:
create or replace function uncsv_py()
returns table(x variant)
language python
imports=('#my_int_stage/fx3.csv')
handler = 'X'
runtime_version = 3.8
as $$
import csv
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
class X:
def process(self):
with open(import_dir + 'fx3.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield(row, )
$$;
And then you can read this UDF that outputs a table:
select *
from table(uncsv_py())
limit 10
A limitation of what I showed here is that the Python UDF needs an explicit name of a file (for now), as it doesn't take a whole folder. Java UDFs do - it will just take longer to write an equivalent UDF.
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-functions.html
https://docs.snowflake.com/en/user-guide/unstructured-data-java.html

Data Load into Snowflake table - Geometry data

I have a requirement to load a csv file which contains geometry data into a Snowflake table.
I am using data load option which is available in the Snowflake WebGUI.
The sample geometry data is as below.
LINESTRING (-118.808186210713 38.2287933407744, -118.808182249848 38.2288155788245, -118.807079844554 38.2293234553217, -118.806532314702 38.229961732287, -118.80625724007 38.2306350645631, -118.805071970015 38.231849721603, -118.804097093763 38.2325380450286, -118.803504299857 38.2328501734747, -118.802726055048 38.2332839062976, -118.802126140311 38.2334442483131, -118.801758172942 38.233542312624)
Since commas are present in the geometry data, the data load option is treating them as a separate column and throwing a error.
I tried using updating the csv file with "to_geography" function as below, but still no luck.
TO_GEOGRAPHY(LINESTRING (-118.808186210713 38.2287933407744, -118.808182249848 38.2288155788245, -118.807079844554 38.2293234553217, -118.806532314702 38.229961732287, -118.80625724007 38.2306350645631, -118.805071970015 38.231849721603, -118.804097093763 38.2325380450286, -118.803504299857 38.2328501734747, -118.802726055048 38.2332839062976, -118.802126140311 38.2334442483131, -118.801758172942 38.233542312624))
So any pointers on this would be appreciated, the full content of the csv file is as below.
ID," GEOGRAPHIC_ROUTE",Name
12421,"LINESTRING (-118.808186210713 38.2287933407744, -118.808182249848 38.2288155788245, -118.807079844554 38.2293234553217, -118.806532314702 38.229961732287, -118.80625724007 38.2306350645631, -118.805071970015 38.231849721603, -118.804097093763 38.2325380450286, -118.803504299857 38.2328501734747, -118.802726055048 38.2332839062976, -118.802126140311 38.2334442483131, -118.801758172942 38.233542312624)",Winston

As I see, the fields are enclosed by double quotes to prevent misinterpretion of comma characters of geographic data (which is good!)
Could you set FIELD_OPTIONALLY_ENCLOSED_BY to '"' (double quote) for your file format, try to re-import the file?
https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#format-type-options-formattypeoptions
I am able to ingest the sample data using the following COPY command:
copy into TEST_TABLE from #my_stage
FILE_FORMAT = (type = csv, FIELD_OPTIONALLY_ENCLOSED_BY='"', skip_header =1 );

Need to select bucket name while reading from s3 stage

I am reading from S3 folder in snowflake via stage.
The bucket in s3 have multiple folder (or object if we want to call it).
The folder is on date basis in the bucket
date=2020-06-01
date=2020-06-02
date=2020-06-03
date=2020-06-04
date=2020-06-05
I am using the below query to read all the folder at once. which is just working fine.
select raw.$1:name name,
raw.$1:id ID
from
#My_Bucket/student_date/
(FILE_FORMAT => PARQUET,
PATTERN =>'.*date=.*\gz.parquet') raw
;
Now i want to select the folder name as well in my query, Is there a way to do it.
like the output to contain
name | id | date..
pleas suggest

Snowflake has a built-in metadata field that provides the full filename, including the path. You should be able to run the following query:
select raw.$1:name name,
raw.$1:id ID,
METADATA$FILENAME
from
#My_Bucket/student_date/
(FILE_FORMAT => PARQUET,
PATTERN =>'.*date=.*\gz.parquet') raw
;
I know you are after the date portion only, but once you have the filename, you can use the SPLIT_PART function to get the date part from the filename. e.g.
SPLIT_PART(METADATA$FILENAME, '/', 4)
Hope this helps.

SSIS: Getting parent row field in sub-row output in XML Source

I have an SSIS Package which reads XML file using XML Source Component.
This XML File has two outputs. One is for "Invoice" and other is for "InvoiceDetail"
The structure of the XML File is like this.
<my:myFields>
<my:group1>
<my:Invoice>
<my:field1>1</my:field1>
<my:field2>2014-11-11</my:field2>
<my:field3>33370</my:field3>
<my:Group2>
<my:InvoiceDetail>
<my:Sub6 xsi:nil="true">100</my:Sub6>
<my:Sub7 xsi:nil="true">Charges</my:Sub7>
<my:Sub8>140</my:Sub8>
<my:Sub9 xsi:nil="true">78</my:Sub9>
<my:Sub10 xsi:nil="true">0</my:Sub10>
<my:Sub12>0</my:Sub12>
</my:InvoiceDetail>
</my:Group2>
<my:field18></my:field18>
</my:Invoice>
</my:group1>
</my:myFields>
I can get all fields of Invoice and InvoiceDetail in seperate outputs.
But, I cannot join these rows since InvoiceDetail doesn't have the ID (field1) which links to the Invoice.
Is there any idea to get the InvoiceID field also with the InvoiceDetail output ?

It Can be possible by XSLT transformation.
Create XSLT schema then give xml,flat file parameter to C# script xml transformation

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

how to Read headers of a CSV file in Snowflake stage - snowflake-cloud-data-platform

Related

Snowflake: How to extract file name from the empty file

Query Snowflake Named Internal Stage by Column NAME and not POSITION

Data Load into Snowflake table - Geometry data

Need to select bucket name while reading from s3 stage

SSIS: Getting parent row field in sub-row output in XML Source

Categories

Resources