Presto: How to read from s3 an entire bucket that is partitioned in sub-folders?

Presto: How to read from s3 an entire bucket that is partitioned in sub-folders? - database

I need to read using presto from s3 an entire dataset that sits in "bucket-a". But, inside the bucket, the data was saved in sub-folders by year. So I have a bucket that looks like that:
Bucket-a>2017>data
Bucket-a>2018>more data
Bucket-a>2019>more data
All the above data is the same table but saved this way in s3. Notice that in the bucket-a itself there is no data, just inside each folder.
What I have to do is read all the data from the bucket as a single table adding a year as column or partition.
I tried doing this way, but didn't work:
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
partitioned_by = ARRAY['year'],
external_location = 's3://bucket-a/'--also tryed 's3://bucket-a/year/'
)
and also
CREATE TABLE hive.default.mytable (
col1 int,
col2 varchar,
year int
)
WITH (
format = 'json',
bucketed_by = ARRAY['year'],
bucket_count = 3,
external_location = 's3://bucket-a/'--also tryed's3://bucket-a/year/'
)
All of the above didn't work.
I have seen people writing with partitions to s3 using presto, but what I'm trying to do is the opposite: read from s3 data that is already splitted in folders as single table.
Thanks.

If your folders were following Hive partition folder naming convention (year=2019/), you could declare the table as partitioned and just use system. sync_partition_metadata procedure in Presto.
Now, your folders do not follow the convention, so you need to register each one individually as a partition using system.register_partition procedure (will be available in Presto 330, about to be released). (The alternative to register_partition is to run appropriate ADD PARTITION in Hive CLI.)

Related

Snowflake:Copy from S3 into Table with nested JSON

Requirement: To load the Nested JSON file into the Snowflake from S3
Error: SQL compilation error: COPY statement only supports simple SELECT from stage statements for import.
I know I can create a temporary table from the SQL, Is there a better way to load directly from the S3 into Snowflake
COPY INTO schema.table_A FROM (
WITH s3 AS (
SELECT $1 AS json_array
FROM '#public.stage'
(file_format => 'public.json',
pattern => 'abc/xyz/.*')
)
SELECT DISTINCT
CURRENT_TIMESTAMP() AS exec_t,
json_array AS data,
json_array:id AS id,
json_array:code::text AS code
FROM s3,TABLE(Flatten(s3.json_array)) f
);

Basically Transformations during loading come along with certain limitations, see here: https://docs.snowflake.com/en/user-guide/data-load-transform.html#transforming-data-during-a-load
If you still want to keep your code and not apply the transformations later, you may create a view on top of the stage and then basically you INSERT into another table based on SELECT * from that view.
Maybe avoiding the CTE is already helping.

Query Snowflake Named Internal Stage by Column NAME and not POSITION

My company is attempting to use Snowflake Named Internal Stages as a data lake to store vendor extracts.
There is a vendor that provides an extract that is 1000+ columns in a pipe delimited .dat file. This is a canned report that they extract. The column names WILL always remain the same. However, the column locations can change over time without warning.
Based on my research, a user can only query a file in a named internal stage using the following syntax:
--problematic because the order of the columns can change.
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]dat.gz') t;
Is there anyway to use the column names instead?
E.g.,
Select t.first_name from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]csv.gz') t;
I appreciate everyone's help and I do realize that this is an unusual requirement.

You could read these files with a UDF. Parse the CSV inside the UDF with code aware of the headers. Then output either multiple columns or one variant.
For example, let's create a .CSV inside Snowflake we can play with later:
create or replace temporary stage my_int_stage
file_format = (type=csv compression=none);
copy into '#my_int_stage/fx3.csv'
from (
select *
from snowflake_sample_data.tpcds_sf100tcl.catalog_returns
limit 200000
)
header=true
single=true
overwrite=true
max_file_size=40772160
;
list #my_int_stage
-- 34MB uncompressed CSV, because why not
;
Then this is a Python UDF that can read that CSV and parse it into an Object, while being aware of the headers:
create or replace function uncsv_py()
returns table(x variant)
language python
imports=('#my_int_stage/fx3.csv')
handler = 'X'
runtime_version = 3.8
as $$
import csv
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
class X:
def process(self):
with open(import_dir + 'fx3.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield(row, )
$$;
And then you can read this UDF that outputs a table:
select *
from table(uncsv_py())
limit 10
A limitation of what I showed here is that the Python UDF needs an explicit name of a file (for now), as it doesn't take a whole folder. Java UDFs do - it will just take longer to write an equivalent UDF.
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-functions.html
https://docs.snowflake.com/en/user-guide/unstructured-data-java.html

Error parsing JSON exception for xml filed in copy command Snowflake

Hi I have declared a table like this
create or replace table app_event (
ID varchar(36) not null primary key,
VERSION number,
ACT_TYPE varchar(255),
EVE_TYPE varchar(255),
CLI_ID varchar(36),
DETAILS variant,
OBJ_TYPE varchar(255),
DATE_TIME timestamp,
AAPP_EVENT_TO_UTC_DT timestamp,
GRO_ID varchar(36),
OBJECT_NAME varchar(255),
OBJ_ID varchar(255),
USER_NAME varchar(255),
USER_ID varchar(255),
EVENT_ID varchar(255),
FINDINGS varchar(255),
SUMMARY variant
);
DETAILS column will contain xml file so that i can run xml function and get element of that xml file .
My sample rows looks like this
dfjkghdfkjghdf8gd7f7997,0,TEST_CASE,CHECK,74356476476DFD,<?xml version="1.0" encoding="UTF-8"?><testPayload><testId>3495864795uiyiu</testId><testCode>COMPLETED</testCode><testState>ONGOING</testState><noOfNewTest>1</noOfNewTest><noOfReviewRequiredTest>0</noOfReviewRequiredTest><noOfExcludedTest>0</noOfExcludedTest><noOfAutoResolvedTest>1</noOfAutoResolvedTest><testerTypes>WATCHLIST</testerTypes></testPayload>,CASE,41:31.3,NULL,948794853948dgjd,(null),dfjkghdfkjghdf8gd7f7997,test user,dfjkghdfkjghdf8gd7f7997,NULL,(null),(null)
When i declare DETAILS as varchar i am able to load file but when i declare this as variant i get below error for that column only
Error parsing JSON:
dfjkghdfkjghdf8gd7f7997COMPLETED</status
File 'SNOWFLAKE/Sudarshan.csv', line 1, character 89 Row 1, column
"AUDIT_EVENT"["DETAILS":6]
Can you please help on this ?
I can not use varchar as i need to query element of xml also in my query .
This is how i load into table and i use default CSV format ,file is available in S3 .
COPY INTO demo_db.public.app_event
FROM #my_s3_stage/
FILES = ('app_Even.csv')
file_format=(type='CSV');
Based on Answer this is how i am loading
copy into demo_db.public.app_event from (
select
$1,$2,$3,$4,$5,
parse_xml($6),$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,parse_xml($17)
from #~/Audit_Even.csv d
)
file_format = (
type = CSV
)
But when i execute it says zero row processed and no mentioned stage here

If you are using a COPY INTO statement then you need to put in a subquery to convert the data before loading it into the table. Use the parse_xml within your copy statement's subquery, something like this:
copy into app_event from (
select
$1,
parse_xml($2) -- <---- "$2" is the column number in the CSV that contains the xml
from #~/test.csv.gz d -- <---- This is my own internal user stage. You'll need to change this to your external stage or whatever
)
file_format = (
type = CSV
)
It is hard to provide you with a good SQL statement without a full example of your existing code (your copy / insert statement). In my example above, I'm copying a file in my own user stage (#~/test.csv.gz) with the default CSV file format options. You are likely using an external stage but it should be easy to adapt this to your own example.

How to convert a CharField to a DateTimeField in peewee on the fly?

I have model I created on the fly for peewee. Something like this:
class TestTable(PeeweeBaseModel):
whencreated_dt = DateTimeField(null=True)
whenchanged = CharField(max_length=50, null=True)
I load data from a text file to a table using peewee, the column "whenchanged" contains all dates in a format of '%Y-%m-%d %H:%M:%S' as varchar column. Now I want to convert the text field "whenchanged" into a datetime format in "whencreated_dt".
I tried several things... I ended up with this:
# Initialize table to TestTable
to_execute = "table.update({table.%s : datetime.strptime(table.%s, '%%Y-%%m-%%d %%H:%%M:%%S')}).execute()" % ('whencreated_dt', 'whencreated')
which fails with a "TypeError: strptime() argument 1 must be str, not CharField": I'm trying to convert "whencreated" to datetime and then assign it to "whencreated_dt".
I tried a variation... following e.g. works without a hitch:
# Initialize table to TestTable
to_execute = "table.update({table.%s : datetime.now()}).execute()" % (self.name)
exec(to_execute)
But this is of course just the current datetime, and not another field.
Anyone knows a solution to this?
Edit... I did find a workaround eventually... but I'm still looking for a better solution... The workaround:
all_objects = table.select()
for o in all_objects:
datetime_str = getattr( o, 'whencreated' )
setattr(o, 'whencreated_dt', datetime.strptime(datetime_str, '%Y-%m-%d %H:%M:%S'))
o.save()
Loop over all rows in the table, get the "whencreated". Convert "whencreated" to a datetime, put it in "whencreated_dt", and save each row.
Regards,
Sven

Your example:
to_execute = "table.update({table.%s : datetime.strptime(table.%s, '%%Y-%%m-%%d %%H:%%M:%%S')}).execute()" % ('whencreated_dt', 'whencreated')
Will not work. Why? Because datetime.strptime is a Python function and operates in Python. An UPDATE query works in database-land. How the hell is the database going to magically pass row values into "datetime.strptime"? How would the db even know how to call such a function?
Instead you need to use a SQL function -- a function that is executed by the database. For example, Postgres:
TestTable.update(whencreated_dt=whenchanged.cast('timestamp')).execute()
This is the equivalent SQL:
UPDATE test_table SET whencreated_dt = CAST(whenchanged AS timestamp);
That should populate the column for you using the correct data type. For other databases, consult their manuals. Note that SQLite does not have a dedicated date/time data type, and the datetime functionality uses strings in the Y-m-d H:M:S format.

Fitbit Data Export - Creating a data warehouse

I plan to create a Fitbit data warehouse for educational purposes, and there doesn't seem to be any material online for Fitbit data specifically.
A few issues faced:
You can only export 1 month of data (max) at a time from the Fitbit website. My plan would be to drop a month's worth of data at a time into a folder, and have these files read seperately.
You can either export the data through CSV or .XLS. The issue with XLS is that each day in the month will create a seperate sheet for food logs, which will then need to be merged in a staging table. The issue with CSV would be that there is one sheet per file, with all of the data in there: CSV Layout
I would then use SSIS to load the data into a SQL Server database for reporting purposes.
Which would the more suited approach be, to export the data using .XLS format or CSV?
Edit: How would it be possible to load a CSV file into SSIS with such a format?
The CSV layout would be as such:
Body,,,,,,,,,
Date,Weight,BMI,Fat,,,,,,
01/06/2018,71.5,23.29,15,,,,,,
02/06/2018,71.5,23.29,15,,,,,,
03/06/2018,71.5,23.29,15,,,,,,
04/06/2018,71.5,23.29,15,,,,,,
05/06/2018,71.5,23.29,15,,,,,,
06/06/2018,71.5,23.29,15,,,,,,
07/06/2018,71.5,23.29,15,,,,,,
08/06/2018,71.5,23.29,15,,,,,,
09/06/2018,71.5,23.29,15,,,,,,
10/06/2018,71.5,23.29,15,,,,,,
11/06/2018,71.5,23.29,15,,,,,,
12/06/2018,71.5,23.29,15,,,,,,
13/06/2018,71.5,23.29,15,,,,,,
14/06/2018,71.5,23.29,15,,,,,,
15/06/2018,71.5,23.29,15,,,,,,
16/06/2018,71.5,23.29,15,,,,,,
17/06/2018,71.5,23.29,15,,,,,,
18/06/2018,71.5,23.29,15,,,,,,
19/06/2018,71.5,23.29,15,,,,,,
20/06/2018,71.5,23.29,15,,,,,,
21/06/2018,71.5,23.29,15,,,,,,
22/06/2018,71.5,23.29,15,,,,,,
23/06/2018,71.5,23.29,15,,,,,,
24/06/2018,71.5,23.29,15,,,,,,
25/06/2018,71.5,23.29,15,,,,,,
26/06/2018,71.5,23.29,15,,,,,,
27/06/2018,71.5,23.29,15,,,,,,
28/06/2018,71.5,23.29,15,,,,,,
29/06/2018,72.8,23.72,15,,,,,,
30/06/2018,72.95,23.77,15,,,,,,
,,,,,,,,,
Foods,,,,,,,,,
Date,Calories In,,,,,,,,
01/06/2018,0,,,,,,,,
02/06/2018,0,,,,,,,,
03/06/2018,0,,,,,,,,
04/06/2018,0,,,,,,,,
05/06/2018,0,,,,,,,,
06/06/2018,0,,,,,,,,
07/06/2018,0,,,,,,,,
08/06/2018,0,,,,,,,,
09/06/2018,0,,,,,,,,
10/06/2018,0,,,,,,,,
11/06/2018,0,,,,,,,,
12/06/2018,0,,,,,,,,
13/06/2018,100,,,,,,,,
14/06/2018,0,,,,,,,,
15/06/2018,0,,,,,,,,
16/06/2018,0,,,,,,,,
17/06/2018,0,,,,,,,,
18/06/2018,0,,,,,,,,
19/06/2018,0,,,,,,,,
20/06/2018,0,,,,,,,,
21/06/2018,0,,,,,,,,
22/06/2018,0,,,,,,,,
23/06/2018,0,,,,,,,,
24/06/2018,0,,,,,,,,
25/06/2018,0,,,,,,,,
26/06/2018,0,,,,,,,,
27/06/2018,"1,644",,,,,,,,
28/06/2018,"2,390",,,,,,,,
29/06/2018,981,,,,,,,,
30/06/2018,0,,,,,,,,
For example, "Foods" would be the table name, "Date" and "Calories In" would be column names. "01/06/2018" is the Date, "0" is the "Calories in" and so on.

Tricky, I just pulled my fitbit data as this peaked my curiosity. That csv is messy. You basically have mixed file formats in one file. That won't be straight forward in SSIS. The XLS format and like you mentioned the food logs tagging each day on the worksheet, SSIS won't like that changing.
CSV:
XLS:
Couple of options off the top of my head that I see for CSV.
Individual exports from Fitbit
I see you can pick which data you want to include in your export: Body, Foods, Activities, Sleep.
Do each export individually, saving each file with a prefix of what type of data it is.
Then build SSIS with multiple foreach loops and data flow task for each individual file format.
That would do it, but would be a tedious effort when having to export the data from Fitbit.
Handle the one file with all the data
This option you would have to get creative since the formats are mixed and you have sections with difference column definitions, etc.
One option would be to create a staging table with as many columns as which ever section has the most, which looks to be maybe "Activities". Give each column a generic name as Column1,Column2 and make them all VARCHAR.
Since we have mixed "formats" and not all data types would line up we just need to get all the data out first and then sort out conversion later.
From there you can build one data flow and flat file source and also get line number added since we will need to sort out where each section of data is later.
When building out the file connection for your source you will have to manually add all columns since the first row of data in your file doesn't include all the commas for each field, SSIS won't be able to detect all the columns. Manually add the number of columns needed, also make sure:
Text Qualifier = "
Header row Delimiter = {LF}
Row Delimiter = {LF}
Column Delimiter = ,
That should get you data loaded into a database at least into a stage table. From there you would need to use a bunch of T-SQL to zero in on each "section" of data and then parse, convert and load from there.
Small test I did I just had table call TestTable:
CREATE TABLE [dbo].[TestTable](
[LineNumber] [INT] NULL,
[Column1] [VARCHAR](MAX) NULL,
[Column2] [VARCHAR](MAX) NULL,
[Column3] [VARCHAR](MAX) NULL,
[Column4] [VARCHAR](MAX) NULL,
[Column5] [VARCHAR](MAX) NULL,
[Column6] [VARCHAR](MAX) NULL,
[Column7] [VARCHAR](MAX) NULL,
[Column8] [VARCHAR](MAX) NULL,
[Column9] [VARCHAR](MAX) NULL
)
Dataflow and hooked up the file source:
Execute dataflow and then I had data loaded as:
From there I worked out some T-SQL to get to each "Section" of data. Here's an example that shows how you could filter to the "Foods" section:
DECLARE #MaxLine INT = (
SELECT MAX([LineNumber])
FROM [TestTable]
);
--Something like this, using a sub query that gets you starting and ending line numbers for each section.
--Doing the conversion of what column that section of data ended up in.
SELECT CONVERT(DATE, [a].[Column1]) AS [Date]
, CONVERT(BIGINT, [a].[Column2]) AS [CaloriesIn]
FROM [TestTable] [a]
INNER JOIN (
--Something like this to build out starting and ending line number for each section
SELECT [Column1]
, [LineNumber] + 2 AS [StartLineNumber] --We add 2 here as the line that start the data in a section is 2 after its "heading"
, LEAD([LineNumber], 1, #MaxLine) OVER ( ORDER BY [LineNumber] )
- 1 AS [EndLineNumber]
FROM [TestTable]
WHERE [Column1] IN ( 'Body', 'Foods', 'Activities' ) --Each of the sections of data
) AS [Section]
ON [a].[LineNumber]
BETWEEN [Section].[StartLineNumber] AND [Section].[EndLineNumber]
WHERE [Section].[Column1] = 'Foods'; --Then just filter on what sectoin you want.
Which in turn gave me the following:
There could be other options for parsing that data, but this should give a good starting point and a idea on how tricky this particular CSV file is.
As for the XLS option, that would be straight forward for all sections except food logs. You would basically setup an excel file connection and each sheet would be a "table" in the source in the data flow and have individual data flows for each worksheet.
But then what about Food logs. Once those changed and you rolled into the next month or something SSIS would freak out, error, probably complain about metadata.
One obvious work around would be manually manipulate the excel and merge all of them into one "Food Log" sheet prior to running it through SSIS. Not ideal because you'd probably want something completely automated.
I'd have to tinker around with that. Maybe a script task and some C# code to combine all those sheets into one, parsing the date out of each sheet name and appending it to the data prior to a data flow loading it. Maybe possible.
Looks like there are challenges with both of the files Fitbit is exporting out no matter which format you look at.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Presto: How to read from s3 an entire bucket that is partitioned in sub-folders? - database

Related

Snowflake:Copy from S3 into Table with nested JSON

Query Snowflake Named Internal Stage by Column NAME and not POSITION

Error parsing JSON exception for xml filed in copy command Snowflake

How to convert a CharField to a DateTimeField in peewee on the fly?

Fitbit Data Export - Creating a data warehouse

Categories

Resources