I'm doing a copy into a table but I'm seeing this error, I don't understand because it looks like a timestamp datatype? is it the 'T' that is effecting this?
The data is cleaned using a pandas df, sent to a csv file that I'm trying to ingest to Snowflake.
I'm wondering if I should change the END_TIME row formatting in pandas like (df['end_time']=pd.to_datetime(dfend_time'])_ or is there another way to ingest to snowflake as is?
Timestamp '2021-09-17T07:00:00+0000' is not recognized
Row 1, column "FACEBOOK_INSIGHTS"["END_TIME":8]
COPY INTO fbk.fbk_insights
FROM (
SELECT CURRENT_TIMESTAMP::TIMESTAMP_LTZ, METADATA$FILENAME, METADATA$FILE_ROW_NUMBER, $1, $2, $3, $4, $5, $6, $7
FROM #luigi.etc/FBK/insights/2021-09-19/
)
file_format = (format_name = 'fbk.fbk_insights')
TRUNCATECOLUMNS = FALSE
FORCE = FALSE
Snowflake supports a variety of input formats, take a look at these to see if it suites.
https://docs.snowflake.com/en/user-guide/date-time-input-output.html#time-formats
If you are not able to find the transformation / format that suites then I'd do as Maja suggests, import and then convert
Related
My company is attempting to use Snowflake Named Internal Stages as a data lake to store vendor extracts.
There is a vendor that provides an extract that is 1000+ columns in a pipe delimited .dat file. This is a canned report that they extract. The column names WILL always remain the same. However, the column locations can change over time without warning.
Based on my research, a user can only query a file in a named internal stage using the following syntax:
--problematic because the order of the columns can change.
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]dat.gz') t;
Is there anyway to use the column names instead?
E.g.,
Select t.first_name from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]csv.gz') t;
I appreciate everyone's help and I do realize that this is an unusual requirement.
You could read these files with a UDF. Parse the CSV inside the UDF with code aware of the headers. Then output either multiple columns or one variant.
For example, let's create a .CSV inside Snowflake we can play with later:
create or replace temporary stage my_int_stage
file_format = (type=csv compression=none);
copy into '#my_int_stage/fx3.csv'
from (
select *
from snowflake_sample_data.tpcds_sf100tcl.catalog_returns
limit 200000
)
header=true
single=true
overwrite=true
max_file_size=40772160
;
list #my_int_stage
-- 34MB uncompressed CSV, because why not
;
Then this is a Python UDF that can read that CSV and parse it into an Object, while being aware of the headers:
create or replace function uncsv_py()
returns table(x variant)
language python
imports=('#my_int_stage/fx3.csv')
handler = 'X'
runtime_version = 3.8
as $$
import csv
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
class X:
def process(self):
with open(import_dir + 'fx3.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield(row, )
$$;
And then you can read this UDF that outputs a table:
select *
from table(uncsv_py())
limit 10
A limitation of what I showed here is that the Python UDF needs an explicit name of a file (for now), as it doesn't take a whole folder. Java UDFs do - it will just take longer to write an equivalent UDF.
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-functions.html
https://docs.snowflake.com/en/user-guide/unstructured-data-java.html
End of record reached while expected to parse column '"VEGETABLE_DETAILS_PLANT_HEIGHT"["HIGH_END_OF_RANGE":5]'
File 'veg_plant_height.csv', line 8, character 14
Row 3, column "VEGETABLE_DETAILS_PLANT_HEIGHT"["HIGH_END_OF_RANGE":5]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
this is my table
create or replace table VEGETABLE_DETAILS_PLANT_HEIGHT (
PLANT_NAME text(7),
VEG_HEIGHT_CODE text(1),
UNIT_OF_MEASURE text(2),
LOW_END_OF_RANGE number(2),
HIGH_END_OF_RANGE number(2)
);
and the COPY INTO command I used
copy into vegetable_details_plant_height
from #like_a_window_into_an_s3_bucket
files = ( 'veg_plant_height.csv')
file_format = ( format_name=VEG_CHALLENGE_CC );
and the csv file https://uni-lab-files.s3.us-west-2.amazonaws.com/veg_plant_height.csv
The error "End of record reached while expected to parse column" means Snowflake detected there were less than expected columns when processing the current row.
Please review your CSV file and make sure each row has correct number of columns. The error said on line 8.
The table has 5 columns but source file consist values for four columns due to this copy command returns the error. In order to resolve the issue you can modified the copy command as mentioned below:
copy into vegetable_details_plant_height(PLANT_NAME, UNIT_OF_MEASURE, LOW_END_OF_RANGE, HIGH_END_OF_RANGE)
from (select $1, $2, $3, $4 from #like_a_window_into_an_s3_bucket)
files = ( 'veg_plant_height.csv') file_format = ( format_name=VEG_CHALLENGE_CC );
As you can see in csv file data in one column is in "" and the names are separated by , so u need to use that FIELD_OPTIONALLY_ENCLOSED_BY = '"' type option
I am unable to load data from a stage on SnowFlake using java.
I don't see any errors but data is not loaded from stage "mystage" to table "TESTTABLE "
Code:
Connection connection = DriverManager.getConnection(connectionUrl, _connectionProperties);
Statement statement = connection.createStatement();
statement.executeQuery("copy into TESTTABLE (id, name) from (select $1, $2 from #mystage/F.csv.gz t);");
If I run same command in SnowFlake console, data is getting loaded into table "TESTTABLE " properly.
We do not know to which database/schema the default connection points to. I would try to use fully qualified table name:
statement.executeQuery("copy into <db_name>.<schema_name>.TESTTABLE (id, name) from (select $1, $2 from #mystage/F.csv.gz t);");
By mistake I had comment out connection.commit(); that was the problem.
I want to be able to add a timestamp the filename I'm writing to s3. So far I've been able to write files to AWS S3 using example below. Can someone guide me as to how do I go about putting datetime stamp in the file name?
copy into #s3bucket/something.csv.gz
from (select * from mytable)
file_format = (type=csv FIELD_OPTIONALLY_ENCLOSED_BY = '"' compression='gzip' )
single=true
header=TRUE;
Thanks in advance.
The syntax for defining a path inside of a stage or location portion of the COPY INTO statement does not allow for functions to dynamically define it in SQL.
However, you can use a stored procedure to accomplish building dynamic queries, using JavaScript Date APIs and some string formatting.
Here's a very trivial example for your use-case, with some code adapted from another question:
CREATE OR REPLACE PROCEDURE COPY_INTO_PROCEDURE_EXAMPLE()
RETURNS VARIANT
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
var rows = [];
var n = new Date();
// May need refinement to zero-pad some values or achieve a specific format
var datetime = `${n.getFullYear()}-${n.getMonth() + 1}-${n.getDate()}-${n.getHours()}-${n.getMinutes()}-${n.getSeconds()}`;
var st = snowflake.createStatement({
sqlText: `COPY INTO '#s3bucket/${datetime}_something.csv.gz' FROM (SELECT * FROM mytable) FILE_FORMAT=(TYPE=CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' COMPRESSION='gzip') SINGLE=TRUE HEADER=TRUE;`
});
var result = st.execute();
result.next();
rows.push(result.getColumnValue(1))
return rows;
$$
To execute, run:
CALL COPY_INTO_PROCEDURE_EXAMPLE();
The above is missing perfected date format handling (zero padding months, days, hours, minutes, seconds), error handling (if the COPY INTO fails), parameterisation of input query, etc. but it should give a general idea on how to achieve this.
As Sharvan Kumar suggests above, Snowflake now support this:
-- Partition the unloaded data by date and hour. Set ``32000000`` (32 MB) as the upper size limit of each file to be generated in parallel per thread.
copy into #%t1
from t1
partition by ('date=' || to_varchar(dt, 'YYYY-MM-DD') || '/hour=' || to_varchar(date_part(hour, ts))) -- Concatenate labels and column values to output meaningful filenames
file_format = (type=parquet)
max_file_size = 32000000
header=true;
list #%t1
This features is not supported yet in snowflake, however will be coming soon.
I have model I created on the fly for peewee. Something like this:
class TestTable(PeeweeBaseModel):
whencreated_dt = DateTimeField(null=True)
whenchanged = CharField(max_length=50, null=True)
I load data from a text file to a table using peewee, the column "whenchanged" contains all dates in a format of '%Y-%m-%d %H:%M:%S' as varchar column. Now I want to convert the text field "whenchanged" into a datetime format in "whencreated_dt".
I tried several things... I ended up with this:
# Initialize table to TestTable
to_execute = "table.update({table.%s : datetime.strptime(table.%s, '%%Y-%%m-%%d %%H:%%M:%%S')}).execute()" % ('whencreated_dt', 'whencreated')
which fails with a "TypeError: strptime() argument 1 must be str, not CharField": I'm trying to convert "whencreated" to datetime and then assign it to "whencreated_dt".
I tried a variation... following e.g. works without a hitch:
# Initialize table to TestTable
to_execute = "table.update({table.%s : datetime.now()}).execute()" % (self.name)
exec(to_execute)
But this is of course just the current datetime, and not another field.
Anyone knows a solution to this?
Edit... I did find a workaround eventually... but I'm still looking for a better solution... The workaround:
all_objects = table.select()
for o in all_objects:
datetime_str = getattr( o, 'whencreated' )
setattr(o, 'whencreated_dt', datetime.strptime(datetime_str, '%Y-%m-%d %H:%M:%S'))
o.save()
Loop over all rows in the table, get the "whencreated". Convert "whencreated" to a datetime, put it in "whencreated_dt", and save each row.
Regards,
Sven
Your example:
to_execute = "table.update({table.%s : datetime.strptime(table.%s, '%%Y-%%m-%%d %%H:%%M:%%S')}).execute()" % ('whencreated_dt', 'whencreated')
Will not work. Why? Because datetime.strptime is a Python function and operates in Python. An UPDATE query works in database-land. How the hell is the database going to magically pass row values into "datetime.strptime"? How would the db even know how to call such a function?
Instead you need to use a SQL function -- a function that is executed by the database. For example, Postgres:
TestTable.update(whencreated_dt=whenchanged.cast('timestamp')).execute()
This is the equivalent SQL:
UPDATE test_table SET whencreated_dt = CAST(whenchanged AS timestamp);
That should populate the column for you using the correct data type. For other databases, consult their manuals. Note that SQLite does not have a dedicated date/time data type, and the datetime functionality uses strings in the Y-m-d H:M:S format.