Create 'granular' date partitions (year, month, date) in S3 parquet folders from a single date column in AWS Wrangler - amazon-sagemaker

I am using data wrangler to upload data from a dataframe into S3 bucket parquet files, and am trying to get it in a 'Hive'-like folder structure of:
prefix
- year=2022
-- month=08
--- day=01
--- day=02
--- day=03
In the following code example:
import awswrangler as wr
import pandas as pd
wr.s3.to_parquet(
df=pd.DataFrame({
'date': ['2022-08-01', '2022-08-02', '2022-08-03'],
'col2': ['A', 'A', 'B']
}),
path='s3://bucket/prefix',
dataset=True,
partition_cols=['date'],
database='default'
)
The resulting s3 folder structure would be:
prefix
- date=2022-08-01
- date=2022-08-02
- date=2022-08-03
The Sagemaker feature store ingest function (https://sagemaker.readthedocs.io/en/stable/api/prep_data/feature_store.html) sort of does this automatically with the event_time_feature_name column (timestamp) automatically creating the Hive file structure in S3.
How can I do this with Data Wrangler without creating 3 additional columns from the 1 column and declaring them as partitions, but put in 1 column and have the partitions by year month and day automatically created?

Related

Query Snowflake Named Internal Stage by Column NAME and not POSITION

My company is attempting to use Snowflake Named Internal Stages as a data lake to store vendor extracts.
There is a vendor that provides an extract that is 1000+ columns in a pipe delimited .dat file. This is a canned report that they extract. The column names WILL always remain the same. However, the column locations can change over time without warning.
Based on my research, a user can only query a file in a named internal stage using the following syntax:
--problematic because the order of the columns can change.
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]dat.gz') t;
Is there anyway to use the column names instead?
E.g.,
Select t.first_name from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]csv.gz') t;
I appreciate everyone's help and I do realize that this is an unusual requirement.
You could read these files with a UDF. Parse the CSV inside the UDF with code aware of the headers. Then output either multiple columns or one variant.
For example, let's create a .CSV inside Snowflake we can play with later:
create or replace temporary stage my_int_stage
file_format = (type=csv compression=none);
copy into '#my_int_stage/fx3.csv'
from (
select *
from snowflake_sample_data.tpcds_sf100tcl.catalog_returns
limit 200000
)
header=true
single=true
overwrite=true
max_file_size=40772160
;
list #my_int_stage
-- 34MB uncompressed CSV, because why not
;
Then this is a Python UDF that can read that CSV and parse it into an Object, while being aware of the headers:
create or replace function uncsv_py()
returns table(x variant)
language python
imports=('#my_int_stage/fx3.csv')
handler = 'X'
runtime_version = 3.8
as $$
import csv
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
class X:
def process(self):
with open(import_dir + 'fx3.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield(row, )
$$;
And then you can read this UDF that outputs a table:
select *
from table(uncsv_py())
limit 10
A limitation of what I showed here is that the Python UDF needs an explicit name of a file (for now), as it doesn't take a whole folder. Java UDFs do - it will just take longer to write an equivalent UDF.
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-functions.html
https://docs.snowflake.com/en/user-guide/unstructured-data-java.html

how to Read headers of a CSV file in Snowflake stage

I am learning snowflake ,I was enter image description here trying to read the headers of CSV file stored in aws bucket ..I used the metadata fields that required me to input $1,$2 as column names and so on to obtain headers(for copy into table creation)..
is there a better alternative to this?
Statement :
select
Top 100 metadata$filename,
metadata$file_row_number,
t.$1,
t.$2,
t.$3,
t.$4,
t.$5,
t.$6
from
#aws_stage t
where
metadata$filename = 'OrderDetails.csv'

Data not loaded correctly into snowflake table from S3 bucket

I am having trouble loading data from an amazon S3 bucket to the snowflake table. This is my command:
copy into myTableName
from 's3://dev-allocation-storage/data_feed/'
credentials=(aws_key_id='***********' aws_secret_key='**********')
PATTERN='.*/.*/.*/.*'
file_format = (type = csv field_delimiter = '|' skip_header = 1 error_on_column_count_mismatch=false );
I have 3 CSV files in my bucket and they are all being loaded into the table. But I have 8 columns in my target table, but they are all being loaded into the first columns as a JSON object.
Check that you do not have each row enclosed in double-quotes. Something like "f1|f2|...|f8". This will be treated like one single column value. Unlike "f1"|"f2"|...|"f8".

Presto: How to read from s3 an entire bucket that is partitioned in sub-folders?

I need to read using presto from s3 an entire dataset that sits in "bucket-a". But, inside the bucket, the data was saved in sub-folders by year. So I have a bucket that looks like that:
Bucket-a>2017>data
Bucket-a>2018>more data
Bucket-a>2019>more data
All the above data is the same table but saved this way in s3. Notice that in the bucket-a itself there is no data, just inside each folder.
What I have to do is read all the data from the bucket as a single table adding a year as column or partition.
I tried doing this way, but didn't work:
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
partitioned_by = ARRAY['year'],
external_location = 's3://bucket-a/'--also tryed 's3://bucket-a/year/'
)
and also
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
bucketed_by = ARRAY['year'],
bucket_count = 3,
external_location = 's3://bucket-a/'--also tryed's3://bucket-a/year/'
)
All of the above didn't work.
I have seen people writing with partitions to s3 using presto, but what I'm trying to do is the opposite: read from s3 data that is already splitted in folders as single table.
Thanks.
If your folders were following Hive partition folder naming convention (year=2019/), you could declare the table as partitioned and just use system. sync_partition_metadata procedure in Presto.
Now, your folders do not follow the convention, so you need to register each one individually as a partition using system.register_partition procedure (will be available in Presto 330, about to be released). (The alternative to register_partition is to run appropriate ADD PARTITION in Hive CLI.)

Import JSON into ClickHouse

I create table with this statement:
CREATE TABLE event(
date Date,
src UInt8,
channel UInt8,
deviceTypeId UInt8,
projectId UInt64,
shows UInt32,
clicks UInt32,
spent Float64
) ENGINE = MergeTree(date, (date, src, channel, projectId), 8192);
Raw data looks like:
{ "date":"2016-03-07T10:00:00+0300","src":2,"channel":18,"deviceTypeId ":101, "projectId":2363610,"shows":1232,"clicks":7,"spent":34.72,"location":"Unknown", ...}
...
Files with data loaded with the following command:
cat *.data|sed 's/T[0-9][0-9]:[0-9][0-9]:[0-9][0-9]+0300//'| clickhouse-client --query="INSERT INTO event FORMAT JSONEachRow"
clickhouse-client throw exception:
Code: 117. DB::Exception: Unknown field found while parsing JSONEachRow format: location: (at row 1)
Is it possible to skip fields from JSON object that not presented in table description?
The latest ClickHouse release (v1.1.54023) supports input_format_skip_unknown_fields user option which eneables skipping of unknown fields for JSONEachRow and TSKV formats.
Try
clickhouse-client -n --query="SET input_format_skip_unknown_fields=1; INSERT INTO event FORMAT JSONEachRow;"
See more details in documentation.
Currently, it is not possible to skip unknown fields.
You may create temporary table with additional field, INSERT data into it, and then do INSERT SELECT into final table. Temporary table may have Log engine and INSERT into that "staging" table will work faster than into final MergeTree table.
It is relatively easy to add possibility to skip unknown fields into code (something like setting 'format_skip_unknown_fields').

Resources