I am migrating my data from SQL Server to Hive using following steps but there is data issue with the resulting table. I tried various options including checking datatype, Using csvSerde but not able to get data aligned properly in respective columns. I followed following steps:
Export SQL Server data to flat file with fields separated by comma.
Create external table in Hive as given below and load data.
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable (
r_date timestamp
, v_nbr varchar(12)
, d_account int
, d_amount decimal(19,4)
, a_account varchar(14)
)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable/mytable.txt' OVERWRITE INTO TABLE myschema.mytable;
There is issue with data with all combination I could try.
I also tried OpenCSVSerde but the result was worse than simple text file. I also tried by changing delimiter to semicolon but no luck.
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties ( "separatorChar" = ",") stored as textfile
location 'gs://mybucket/myschema.db/mytable/';
Can you please suggest some robust approach so that I don't have to deal with data issue.
Note: Currently I don't have option of connecting my SQL Server table with Sqoop.
Related
I'm exporting data into Parquet files and importing it into Snowflake. The export is done with python (using to_parquet from pandas) on a Windows Server machine.
The exported file has several timestamp columns. Here's the metadata of one of these columns (ParquetViewer):
I'm having weird issues trying to import the timestamp columns into Snowflake.
Attempt 1 (using the copy into):
create or replace table STAGING.DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0),
"ExitDate" TIMESTAMP_NTZ(9)
);
copy into STAGING.DIM_EMPLOYEE
from #S3
pattern='dim_Employee_.*.parquet'
file_format = (type = parquet)
match_by_column_name = case_insensitive;
select * from STAGING.DIM_EMPLOYEE;
The timestamp column is not imported correctly:
It seems that Snowflake assumes that the value in the column is in seconds and not in microseconds and therefore converts incorrectly.
Attempt 2 (using the external tables):
Then I created an external table:
create or replace external table STAGING.EXT_DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (CAST(GET($1, 'ExitDate') AS TIMESTAMP_NTZ(9)))
)
location=#S3
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE;
The data is still incorrect - still the same issue (seconds instead of microseconds):
Attempt 3 (using the external tables, with modified TO_TIMESTAMP):
I've then modified the external table definition to specifically define that microseconds are used TO_TIMESTAMP_TNZ with scale parameter 6:
create or replace external table STAGING.EXT_DIM_EMPLOYEE_V2(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (TO_TIMESTAMP_NTZ(TO_NUMBER(GET($1, 'ExitDate')), 6))
)
location=#CHICOREE_D365_BI_STAGE/
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE_V2;
Now the data is correct:
But now the "weird" issue appears:
I can load the data into a table, but the load is quite slow and I get a Querying (repair) message during the load. However, at the end, the query is executed, albeit slow:
I want to load the data from stored procedure, using SQL script. When executing the statement using the EXECUTE IMMEDIATE an error is returned:
DECLARE
SQL STRING;
BEGIN
SET SQL := 'INSERT INTO STAGING.DIM_EMPLOYEE ("EmployeeID", "ExitDate") SELECT "EmployeeID", "ExitDate" FROM STAGING.EXT_DIM_EMPLOYEE_V2;';
EXECUTE IMMEDIATE :SQL;
END;
I have also tried to define the timestamp column in an external table as a NUMBER, import it and later convert it into timestamp. This generates the same issue (returning SQL execution internal error in SQL script).
Has anyone experienced an issue like this - it seems to me like a bug?
Basically - my goal is to generate insert/select statements dynamically and execute them (in stored procedures). I have a lot of files (with different schemas) that need to be imported and I want to create an "universal logic" to load these Parquet files into Snowflake.
As confirmed in the Snowflake Support ticket you opened, this issue got resolved when the Snowflake Support team enabled an internal configuration for Parquet timestamp logical types.
If anyone encounters a similar issue please submit a Snowflake Support ticket.
I often want to quickly load a CSV into an Oracle database. The CSV (Unicode) is on a machine with an Oracle InstantClient version 19.5, the Oracle database is of version 18c.
I look for a command line tool which uploads the rows without me specifying a column structure.
I know I can use sqlldr with a .ctl file, but then I need to define columns types, etc. I am interested in a tool which figures out the column attributes itself from the data in the CSV (or uses a generic default for all columns).
The CSVs I have to ingest contain always a header row the tool in question could use to determine appropriate columns in the table.
Starting with Oracle 12c, you can use sqlldr in express mode, thereby you don't need any control file.
In Oracle Database 12c onwards, SQLLoader has a new feature called
express mode that makes loading CSV files faster and easier. With
express mode, there is no need to write a control file for most CSV
files you load. Instead, you can load the CSV file with just a few
parameters on the SQLLoader command line.
An example
Imagine I have a table like this
CREATE TABLE EMP
(EMPNO number(4) not null,
ENAME varchar2(10),
HIREDATE date,
DEPTNO number(2));
Then a csv file that looks like this
7782,Clark,09-Jun-81,10
7839,King,17-Nov-81,12
I can use sqlldr in express mode :
sqlldr userid=xxx table=emp
You can read more about express mode in this white paper
Express Mode in SQLLDR
Forget about using sqlldr in a script file. Your best bet is on using an external table. This is a create table statement with sqlldr commands that will read a file from a directory and store it as a table. Super easy, really convenient.
Here is an example:
create table thisTable (
"field1" varchar2(10)
,"field2" varchar2(100)
,"field3" varchar2(100)
,"dateField" date
) organization external (
type oracle_loader
default directory <createDirectoryWithYourPath>
access parameters (
records delimited by newline
load when (fieldname != BLANK)
skip 9
fields terminated by ',' optionally ENCLOSED BY '"' ltrim
missing field values are null
(
"field1"
,"field2"
,"field3"
,"dateField" date 'mm/dd/yyyy'
)
)
location ('filename.csv')
);
I am using Microsoft SQL Server Management studio and I am currently importing some CSV files in a database. I am importing the CSV files using the BULK INSERT command into already existing tables, using the following query.
BULK INSERT myTable
FROM >>'D:\myfolder\file.csv'
WITH
(FIRSTROW = 2,
FIELDTERMINATOR = ';', --CSV Field Delimiter
ROWTERMINATOR = '\n', -- Used to shift to the next row
ERRORFILE = 'D:\myfolder\Error Files\myErrrorFile.csv',
TABLOCK
)
This works fine for me thus far, but I would like to automate the process of naming columns in tables. More specifically I would like to create a table and use as column names, the contents of the first row of the CSV file. Is that possible?
The easiest way I can think of is:
right-click on the database, select: Tasks -> Import Data...
After that, SQL Server Import and Export Wizard will display. There you have everything to specify and custom settings on importing data from any sources (such as getting column names from first row in a file).
In your case, your data source will be Flat file source.
I am processing a large 120 GB file using hive. Data is first loaded from sql server table to aws s3 as csv file (tab separated) and then hive external table is created on top of this file. I have encountered a problem while querying data from hive external table. I noticed that csv contains \n in many columns fields (which was actually “null” in sql server). Now when I create hive table the \n that appears in any record takes hive to new record and generate NULL for rest of the columns in that record. I tried lines terminated by "001" but no success. I get error that hive only supports only "lines terminated by \n". My question is if hive supports only \n as line separator how would you handle columns that contains \n values?
Any suggestions?
This is how I am creating my external table:
DROP TABLE IF EXISTS IMPT_OMNITURE__Browser;
CREATE EXTERNAL TABLE IMPT_OMNITURE__Browser (
ID int, Region string, Description string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://abm-dw/data-import/omniture/Browser/';
You could alter the table with the below command or add the property in the create statement in the TBL properties ;
ALTER TABLE table set SERDEPROPERTIES ('serialization.null.format' = "");
This would make the data in the file as NULL.
Table name : sample
Table structure :
ID int
NAME varchar(30)
IPADDRESS varbinary(16)
mysql query :
load data concurrent local infile 'C:\test.txt' into table sample fields terminated by ',' LINES TERMINATED BY '\r\n' (ID,NAME,#var3) set IPADDRESS = inet_pton(#var3)
SQL Server equivalent query :
??
using bcp will be appreciated..
Thanks in advance..
Here is an article which you will find useful:
How do I load text or csv file data into SQL Server?
It was the second result from Google when searching for "bcp load file"
EDIT:
You might be able to do your import in two steps. Load the rows from the file in to a temp table then apply a function to convert the IP strings to the binary format.
Have a look at this question on SO: Datatype for storing ip address in SQL Server