Problems importing timestamp from Parquet files - snowflake-cloud-data-platform

Problems importing timestamp from Parquet files - snowflake-cloud-data-platform

I'm exporting data into Parquet files and importing it into Snowflake. The export is done with python (using to_parquet from pandas) on a Windows Server machine.
The exported file has several timestamp columns. Here's the metadata of one of these columns (ParquetViewer):
I'm having weird issues trying to import the timestamp columns into Snowflake.
Attempt 1 (using the copy into):
create or replace table STAGING.DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0),
"ExitDate" TIMESTAMP_NTZ(9)
);
copy into STAGING.DIM_EMPLOYEE
from #S3
pattern='dim_Employee_.*.parquet'
file_format = (type = parquet)
match_by_column_name = case_insensitive;
select * from STAGING.DIM_EMPLOYEE;
The timestamp column is not imported correctly:
It seems that Snowflake assumes that the value in the column is in seconds and not in microseconds and therefore converts incorrectly.
Attempt 2 (using the external tables):
Then I created an external table:
create or replace external table STAGING.EXT_DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (CAST(GET($1, 'ExitDate') AS TIMESTAMP_NTZ(9)))
)
location=#S3
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE;
The data is still incorrect - still the same issue (seconds instead of microseconds):
Attempt 3 (using the external tables, with modified TO_TIMESTAMP):
I've then modified the external table definition to specifically define that microseconds are used TO_TIMESTAMP_TNZ with scale parameter 6:
create or replace external table STAGING.EXT_DIM_EMPLOYEE_V2(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (TO_TIMESTAMP_NTZ(TO_NUMBER(GET($1, 'ExitDate')), 6))
)
location=#CHICOREE_D365_BI_STAGE/
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE_V2;
Now the data is correct:
But now the "weird" issue appears:
I can load the data into a table, but the load is quite slow and I get a Querying (repair) message during the load. However, at the end, the query is executed, albeit slow:
I want to load the data from stored procedure, using SQL script. When executing the statement using the EXECUTE IMMEDIATE an error is returned:
DECLARE
SQL STRING;
BEGIN
SET SQL := 'INSERT INTO STAGING.DIM_EMPLOYEE ("EmployeeID", "ExitDate") SELECT "EmployeeID", "ExitDate" FROM STAGING.EXT_DIM_EMPLOYEE_V2;';
EXECUTE IMMEDIATE :SQL;
END;
I have also tried to define the timestamp column in an external table as a NUMBER, import it and later convert it into timestamp. This generates the same issue (returning SQL execution internal error in SQL script).
Has anyone experienced an issue like this - it seems to me like a bug?
Basically - my goal is to generate insert/select statements dynamically and execute them (in stored procedures). I have a lot of files (with different schemas) that need to be imported and I want to create an "universal logic" to load these Parquet files into Snowflake.

As confirmed in the Snowflake Support ticket you opened, this issue got resolved when the Snowflake Support team enabled an internal configuration for Parquet timestamp logical types.
If anyone encounters a similar issue please submit a Snowflake Support ticket.

Related

Migrating from SQL Server to Hive Table using flat file

I am migrating my data from SQL Server to Hive using following steps but there is data issue with the resulting table. I tried various options including checking datatype, Using csvSerde but not able to get data aligned properly in respective columns. I followed following steps:
Export SQL Server data to flat file with fields separated by comma.
Create external table in Hive as given below and load data.
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable (
r_date timestamp
, v_nbr varchar(12)
, d_account int
, d_amount decimal(19,4)
, a_account varchar(14)
)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable/mytable.txt' OVERWRITE INTO TABLE myschema.mytable;
There is issue with data with all combination I could try.
I also tried OpenCSVSerde but the result was worse than simple text file. I also tried by changing delimiter to semicolon but no luck.
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties ( "separatorChar" = ",") stored as textfile
location 'gs://mybucket/myschema.db/mytable/';
Can you please suggest some robust approach so that I don't have to deal with data issue.
Note: Currently I don't have option of connecting my SQL Server table with Sqoop.

Import data into SQL Server using BCP utility (export the log file with the error records and continue inserting with the normal records)

I have a data set and want to import it into my database with the condition:
In case there is a record that cannot be imported, it can be extracted into a log
Although existing records can not be imported but still allow import of records that can be imported (other records) and continue to process
Currently I use the BCP utility to import data into the table from the csv file with:
bcp table_name IN C:\Users\09204086121\Desktop\data.csv -T -c -o C:\Users\09204086121\Desktop\logOut.log -e C:\Users\09204086121\Desktop\errOut.log
It just satisfies my condition 1 above.
I need that when the record has error (duplicate primary key,...), write to log (1) and continue to insert into the table the other normal records (2).
I came up with the idea that combining trigger with bcp, after creating a trigger and adding the parameter -h "FIRE_TRIGGERS" to the bcp statement, the insert will ignore records that have the same key but it won't write to the log.
This is my trigger.
ALTER TRIGGER [PKGORDERCOMMON].[T_ImportData] ON [PKGORDERCOMMON].[IF_R_BUNRUI1]
INSTEAD OF INSERT
AS
BEGIN
--Insert non duplicate records
INSERT INTO [IF_R_BUNRUI1]
(
SYSTEM_KB,
BUNRUI1_CD,
BUNRUI1_KANJI_NA,
BUNRUI1_KANA_NA,
CREATE_TS
)
SELECT SYSTEM_KB,
BUNRUI1_CD,
BUNRUI1_KANJI_NA,
BUNRUI1_KANA_NA,
CREATE_TS
FROM inserted i
WHERE NOT EXISTS
(
SELECT *
FROM [IF_R_BUNRUI1] c
WHERE c.BUNRUI1_CD = i.BUNRUI1_CD
AND c.SYSTEM_KB = i.SYSTEM_KB
);
END;
Is there anyone who can help me.

BCP is not meant for what you are asking it to do (separate good and bad records). For instance, bcp -e option has a limit to how many records it will show. Im not sure if this limit is tied to the "max errors" option, but regardless there is a limit.
Your best option is to load all the records and address bad data in t-sql.
Load all records in such a way to ignore conversion errors. Either:
load entire line from file into a single, large varchar column. Parse out columns and qc data as needed.
or
load all columns from source file into generic varchar columns with large enough size to accomodate your source data.
Either way, when done, use t-sql to inspect your data and split among good/bad records.

SQL Server: Error converting data type varchar to numeric (Strange Behaviour)

I'm working on a legacy system using SQL Server in 2000 compatibility mode. There's a stored procedure that selects from a query into a virtual table.
When I run the query, I get the following error:
Error converting data type varchar to numeric
which initially tells me that something stringy is trying to make its way into a numeric column.
To debug, I created the virtual table as a physical table and started eliminating each column.
The culprit column is called accnum (which stores a bank account number, which has a source data type of varchar(21)), which I'm trying to insert into a numeric(16,0) column, which obviously could cause issues.
So I made the accnum column varchar(21) as well in the physical table I created and it imports 100%. I also added an additional column called accnum2 and made it numeric(16,0).
After the data is imported, I proceeded to update accnum2 to the value of accnum. Lo and behold, it updates without an error, yet it wouldn't work with an insert into...select query.
I have to work with the data types provided. Any ideas how I can get around this?

Can you try to use conversion in your insert statement like this:
SELECT [accnum] = CASE ISNUMERIC(accnum)
WHEN 0 THEN NULL
ELSE CAST(accnum AS NUMERIC(16, 0))
END

Unable to run "INSERT INTO" from Azure SQL external table

In my Azure SQL DB I have an external table - let's call this tableName_origData - and I have another table which we'll refer to as tableName.
tableName was created using a generated CREATE script from tableName_origData (in its original location) so I can be sure that all the column types are identical.
However, when I run
INSERT INTO tableName (
[list of column names]
)
SELECT
[same list of column names]
FROM
tableName_origData
I encounter the following exception:
Large object column support is limited to only nvarchar(max) data
type.
As far as my understanding of Azure SQL's data types goes, I don't have anything larger than NVARCHAR(MAX). Furthermore, the message implies that NVARCHAR(MAX) is supported (and I can see that the same script works on other tables which contain NVARCHAR(MAX).
Can anyone better explain the cause of this exception and what I might need to do in order to insert its data into an identical table?
Here is a list of all the column types used in the table(s):
BIGINT x 3
NCHAR(20) x 1
NVARCHAR(45) x 5
NVARCHAR(100) x 14
NVARCHAR(MAX) x 10

External tables is read-only. The developer can select data, but cannot perform any form of DML-processing
To solve this issue please use this technique:
https://technology.amis.nl/2005/04/05/updateable-external-tables/
Warn: Unless for the simplest of uses, we do not recommend using this technique for any serious application

Loading data of one table into another residing on different databases - Netezza

I have a big file which I have loaded in a table in a netezza database using an ETL tool, lets call this database Staging_DB. Now, post some verifications, the content of this table needs to be inserted into similar structured table residing in another netezza DB, lets call this one PROD_DB. What is the fastest way to transfer data from staging_DB to PROD_DB?
Should I be using the ETL tool to load the data into PROD_DB? Or,
Should the transfer be done using external tables concept?

If there is no transformation need to be done, then better way to transfer is cross database data transfer. As described in Netezza documentation that Netezza support cross database support where the user has object level permission on both databases.
You can check permission with following command -
dbname.schemaname(loggenin_username)=> \dpu username
Please find below working example -
INSERT INTO Staging_DB..TBL1 SELECT * FROM PROD_DB..TBL1
If you want to do some transformation and than after you need to insert in another database then you can write UDT procedures (also called as resultset procedures).
Hope this will help.

One way you could move the data is by using Transient External Tables. Start by creating a flat file from your source table/db. Because you are moving from Netezza to Netezza you can save time and space by turning on compression and using internal formatting.
CREATE EXTERNAL TABLE 'C:\FileName.dat'
USING (
delim 167
datestyle 'MDY'
datedelim '/'
maxerrors 2
encoding 'internal'
Compress True
REMOTESOURCE 'ODBC'
logDir 'c:\' ) AS
SELECT * FROM source_table;
Then create the table in your target database using the same DDL in the source and just load it up.
INSERT INTO target SELECT * FROM external 'C:\FileName.dat'
USING (
delim 167
datestyle 'MDY'
datedelim '/'
maxerrors 2
encoding 'internal'
Compress True
REMOTESOURCE 'ODBC'
logDir 'c:\' );

I would write a SP on production db and do a CTAS from stage to production database. The beauty of SP is you can add transformations as well.
One other option is NZ migrate utility provided by Netezza and that is the fastest route I believe.

A simple SQL query like
INSERT INTO Staging_DB..TBL1 SELECT * FROM PROD_DB..TBL1
works great if you just need to do that.
Just be aware that you have to be connected to the destination database when executing the query, otherwise you will get an error code
HY0000: "Cross Database Access not supported for this type of command"
even if you have read/write access to both databases and tables.

In most cases you can simply change the catalog using a "Set Catalog" command
https://www-304.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_set_catalog.html

set catalog='database_name';
insert into target_db.target_schema.target_table select source_db.source_schema.source_table;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Problems importing timestamp from Parquet files - snowflake-cloud-data-platform

As confirmed in the Snowflake Support ticket you opened, this issue got resolved when the Snowflake Support team enabled an internal configuration for Parquet timestamp logical types. If anyone encounters a similar issue please submit a Snowflake Support ticket.

Related

Migrating from SQL Server to Hive Table using flat file

Import data into SQL Server using BCP utility (export the log file with the error records and continue inserting with the normal records)

SQL Server: Error converting data type varchar to numeric (Strange Behaviour)

Unable to run "INSERT INTO" from Azure SQL external table

Loading data of one table into another residing on different databases - Netezza

Categories

Resources