COPY INTO with partitioned ADLS - snowflake-cloud-data-platform

I have a container with partitioned parquet files that I want to use with the copy into command. My directories look like the below.
ABC_PARTITIONED_ID=1 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
4-snappy.parquet
ABC_PARTITIONED_ID=2 (directory)
1-snappy.parquet
2-snappy.parquet
3-snappy.parquet
ABC_PARTITIONED_ID=3 (directory)
1-snappy.parquet
2-snappy.parquet
....
Each partitioned directory can contain multiple parquet files. I do not have a hive partition column that matches the pattern of the directories (ID1, ID2 etc).
How do I properly use the pattern parameter in the copy into command to write to a SF table from my ADLS? I am using this https://www.snowflake.com/blog/how-to-load-terabytes-into-snowflake-speeds-feeds-and-techniques/ as an example.

I do not think that you have anything to do with the pattern parameter.
You said you do not have a hive partition column that matches the pattern of the directories. If you do not have a column to use these partitions, then they are probably not beneficial for querying the data. Maybe they were generated to help maintenance. If this is the case, ignore the partition, and read all files with the COPY command.
If you think having such a column would help, then the blog post (you mentioned) already shows how you can parse the filenames to generate the column value. Add the partition column to your table (and even you may define it as the clustering key), and run the COPY command to read all files in all partitions/directories, parse the value of the column from the file name.
For parsing the partition value, I would use this one which seems easier:
copy into TARGET_TABLE from (
select
REGEXP_SUBSTR (
METADATA$FILENAME,
'.*\ABC_PARTITIONED_ID=(.*)\/.*',
1,1,'e',1
) partitioned_column_value,
$1:column_name,
...
from #your_stage/data_folder/);

If the directory/partition name doesn't matter to you, then you can use some of the newer functions in Public Preview that support Parquet format to create the table and ingest the data. Your question on how to construct the pattern would be PATTERN='*.parquet' as all subfolders would be read.
//create file format , only required to create one time
create file format my_parquet_format
type = parquet;
//EXAMPLE CREATE AND COPY INTO FOR TABLE1
//create an empty table using this file format and location. name the table table1
create or replace table ABC
using template (
select array_agg(object_construct(*))
from table(
infer_schema(
location=>'#mystage/ABC_PARTITIONED_ROOT',
file_format=>'my_parquet_format'
)
));
//copy parquet files in folder /table1 into table TABLE1
copy into ABC from #mystage/ABC_PARTITIONED_ROOT pattern = '*.parquet' file_format=my_parquet_format match_by_column_name=case_insensitive;

This should be possible by creating a storage integration, granting access in Azure for Snowflake to access the storage location, and then creating an external stage.
Alternatively you can generate a shared access signature (SAS) token to grant Snowflake (limited) access to objects in your storage account. You can then access an external (Azure) stage that references the container using the SAS token.

Snowflake metadata provides
METADATA$FILENAME - Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
METADATA$FILE_ROW_NUMBER - Row number for each record
We could do something like this:
select $1:normal_column_1, ..., METADATA$FILENAME
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
For example: it would give something like:
METADATA$FILENAME
----------
path/to/data/year=2021/part-00020-6379b638-3f7e-461e-a77b-cfbcad6fc858.c000.snappy.parquet
we need to handle deducing the column from it. We could do a regexp_replace and get the partition value as column like this:
select
regexp_replace(METADATA$FILENAME, '.*\/year=(.*)\/.*', '\\1'
) as year
$1:normal_column_1,
FROM
'#stage_name/path/to/data/' (pattern => '.*.parquet')
limit 5;
In the above regexp, we give the partition key.
Third parameter \\1 is the regex group match number. In our case, first group match - this holds the partition value.
More detailed answer and other approaches to solve this issue is available on this stackoverflow answer

Related

Locating Columns that Contain a String in their Name

Other than manually traversing every table schema in the entire database, how can I produce a list of all tables that contain a field containing the string "email" in Pervasive 13?
For example, in IBM DB2, I can do this with a query like this:
select tabschema,tabname,colname
from syscat.columns
where upper(colname) LIKE UPPER('%email%')
order by tabname
How can I achieve this in Pervasive 13?
You can query the System Objects, use:
SELECT f.Xf$Name, g.Xe$Name
FROM X$File f
INNER JOIN X$Field g ON g.Xe$File = f.Xf$Id
WHERE UPPER(g.Xe$Name) LIKE '%EMAIL%';
I'm still open to other suggestions, but the way I did this was by exporting the database schema to a .sql text file, and I used a regular expression create table.*email to search through that file and locate all the tables containing a column with email in their name.
This worked, but I look forward to other people's suggestions.

query snowflake s3 external file

I've created an S3 [external] stage and uploaded csv files into \stage*.csv folder.
I can see stage content by doing list #my_stage.
if I query the stage
select $1,$2,$3,$4,$5,$6 from #my_s3_stage it looks like I'm randomly picking up files.
So I'm trying to select from specific file by adding a pattern
PATTERN => job.csv
This returns no results.
Note: I've used snowflake for all of 5 hours so pretty new to syntax
For a pattern you can use
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.*data.*[.]csv.gz') t;
The pattern is a regex expression.
For a certain file you have to add the file name to the query like this:
select t.$1, t.$2 from #mystage/data1.csv.gz;
If your file format is set in your stage definition, you don't need the file format-parameter.
More info can be found here: https://docs.snowflake.com/en/user-guide/querying-stage.html

How to query multiple JSON document schemas in Snowflake?

Could anyone tell me how to change the Stored Procedure in the article below to recursively expand all the attributes of a json file (multiple JSON document schemas)?
https://support.snowflake.net/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling-part-2
Craig Warman's stored procedure posted in that blog is a great idea. I asked him if it was okay to refactor his code, and he agreed. I've used the refactored version in the field, so I know the SP well as well as how it works.
It may be possible to modify the SP to work on your JSON. It will depend on whether or not Snowflake types the JSON in your variant column. The way you have it structured, it may not type everything. You can check by running this SQL and seeing if the result set includes all the columns you need:
set VARIANT_TABLE = 'WEATHER';
set VARIANT_COLUMN = 'V';
with MAIN_TABLE as
(
select * from identifier($VARIANT_TABLE) sample (1000 rows)
)
select distinct REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\[(.+)\\]'),'[^a-zA-Z0-9]','_') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
typeof(f.value) AS attribute_type, -- This generates column datatypes.
path_name AS alias_name -- This generates column aliases based on the path
from
MAIN_TABLE,
LATERAL FLATTEN(identifier($VARIANT_COLUMN), RECURSIVE=>true) f
where TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
Be sure to replace the variables to your table and column names. If this picks up the type information for the columns in your JSON, then it's possible to modify this SP to do what you need. If it doesn't but there's a way to modify the query to get it to pick up the columns, that would work too.
If it doesn't pick up the columns, based on Craig's idea I decided to write type inference for non variant (such as strings from CSV log files without type information). Try the SQL above and see what results first.

Copy data files from internal stage table to Logical tables

I am dealing with json and csv files moving from Unix/S3 bucket to Internal/External stage receptively
and I don't have any issue with json files copy from Internal/External stages to Static or logical table, where I am storing as JsonFileName, and JsonFileContent
Trying to copy to Static table ( parse_json($1) is working for JSON)
COPY INTO LogicalTable (FILE_NM, JSON_CONTENT)
from (
select METADATA$FILENAME AS FILE_NM, parse_json($1) AS JSON_CONTENT
from #$TSJsonExtStgName
)
file_format = (type='JSON' strip_outer_array = true);
I am looking for something similar for CSV, copy csv file name and csv file content from internal/external staging to Static or logical tables. Mainly looking for this to separate file copy and file loading, load may fail due number of columns mismatch, newline character, or bad data in one of the records.
If any one of below gets clarified is fine, please suggest
1) Trying to copy to Static table (METADATA$?????? not working for CSV)
select METADATA$FILENAME AS FILE_NM, METADATA$?????? AS CSV_CONTENT
from #INT_REF_CSV_UNIX_STG
2) Trying for dynamic columns (T.* not working for CSV)
SELECT METADATA$FILENAME,$1, $2, $3, T.*
FROM #INT_REF_CSV_UNIX_STG(FILE_FORMAT => CSV_STG_FILE_FORMAT)T
Regardless of whether the file is CSV or JSON, you need to make sure that your SELECT matches the table layout of the target table. I assume with your JSON, your target table is 2 columns...filename and a VARIANT column for your JSON contents. For CSV, you need to do the same thing. So, you need to do the $1, $2, etc. for each column that you want from the file...that matches your target table.
I have no idea what you are referencing with METADATA$??????, btw.
---ADDED
Based on your comment below, you have 2 options, which aren't native to a COPY INTO statement:
1) Create a Stored Procedure that looks at a table DDL and generates a COPY INTO statement that has the static columns defined and then executing the COPY INTO from within the SP.
2) Leverage an External Table. By defining an External Table with the METADATA$FILENAME and the rest of the columns, the External Table will return the CSV contents to you as JSON. From there, you can treat it in the same way you are treating your JSON tables.

How to Dynamically render Table name and File name in pentaho DI

I have a requirement in which one source is a table and one source is a file. I need to join these both on a column. The problem is that I can do this for one table with one transformation but I need to do it for multiple set of files and tables to load into another set of specific files as target using the same transformation.
Breaking down my requirement more specifically :
Source Table Source File Target File
VOICE_INCR_REVENUE_PROFILE_0 VoiceRevenue0 ProfileVoice0
VOICE_INCR_REVENUE_PROFILE_1 VoiceRevenue1 ProfileVoice1
VOICE_INCR_REVENUE_PROFILE_2 VoiceRevenue2 ProfileVoice2
VOICE_INCR_REVENUE_PROFILE_3 VoiceRevenue3 ProfileVoice3
VOICE_INCR_REVENUE_PROFILE_4 VoiceRevenue4 ProfileVoice4
VOICE_INCR_REVENUE_PROFILE_5 VoiceRevenue5 ProfileVoice5
VOICE_INCR_REVENUE_PROFILE_6 VoiceRevenue6 ProfileVoice6
VOICE_INCR_REVENUE_PROFILE_7 VoiceRevenue7 ProfileVoice7
VOICE_INCR_REVENUE_PROFILE_8 VoiceRevenue8 ProfileVoice8
VOICE_INCR_REVENUE_PROFILE_9 VoiceRevenue9 ProfileVoice9
The table and file names are always corresponding i.e. VOICE_INCR_REVENUE_PROFILE_0 should always join with VoiceRevenue0 and the result should be stored in ProfileVoice0. There should be no mismatches in this case. I tried setting the variables with table names and file names, but it only takes on value at a time.
All table names and file names are constant. Is there any other way to get around this. Any help would be appreciated.
Try using "Copy rows to result" step. It will store all the incoming rows (in your case the table and file names) into a memory. And for every row, it will try to execute your transformation. In this way, you can read multiple filenames at one go.
Try reading this link. Its not the exact answer, but similar.
I have created a sample here. Please check if this is what is required.
In the first transformation, i read the tablenames and filenames and loaded it in the memory. After that i have used the get variable step to read all the files and table names to generate the output. [Note: I have not used table input as source anywhere, instead used TablesNames. You can replace the same with the table input data.]
Hope it helps :)

Resources