I am processing a large 120 GB file using hive. Data is first loaded from sql server table to aws s3 as csv file (tab separated) and then hive external table is created on top of this file. I have encountered a problem while querying data from hive external table. I noticed that csv contains \n in many columns fields (which was actually “null” in sql server). Now when I create hive table the \n that appears in any record takes hive to new record and generate NULL for rest of the columns in that record. I tried lines terminated by "001" but no success. I get error that hive only supports only "lines terminated by \n". My question is if hive supports only \n as line separator how would you handle columns that contains \n values?
Any suggestions?
This is how I am creating my external table:
DROP TABLE IF EXISTS IMPT_OMNITURE__Browser;
CREATE EXTERNAL TABLE IMPT_OMNITURE__Browser (
ID int, Region string, Description string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://abm-dw/data-import/omniture/Browser/';
You could alter the table with the below command or add the property in the create statement in the TBL properties ;
ALTER TABLE table set SERDEPROPERTIES ('serialization.null.format' = "");
This would make the data in the file as NULL.
Related
I am migrating my data from SQL Server to Hive using following steps but there is data issue with the resulting table. I tried various options including checking datatype, Using csvSerde but not able to get data aligned properly in respective columns. I followed following steps:
Export SQL Server data to flat file with fields separated by comma.
Create external table in Hive as given below and load data.
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable (
r_date timestamp
, v_nbr varchar(12)
, d_account int
, d_amount decimal(19,4)
, a_account varchar(14)
)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable/mytable.txt' OVERWRITE INTO TABLE myschema.mytable;
There is issue with data with all combination I could try.
I also tried OpenCSVSerde but the result was worse than simple text file. I also tried by changing delimiter to semicolon but no luck.
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties ( "separatorChar" = ",") stored as textfile
location 'gs://mybucket/myschema.db/mytable/';
Can you please suggest some robust approach so that I don't have to deal with data issue.
Note: Currently I don't have option of connecting my SQL Server table with Sqoop.
I often want to quickly load a CSV into an Oracle database. The CSV (Unicode) is on a machine with an Oracle InstantClient version 19.5, the Oracle database is of version 18c.
I look for a command line tool which uploads the rows without me specifying a column structure.
I know I can use sqlldr with a .ctl file, but then I need to define columns types, etc. I am interested in a tool which figures out the column attributes itself from the data in the CSV (or uses a generic default for all columns).
The CSVs I have to ingest contain always a header row the tool in question could use to determine appropriate columns in the table.
Starting with Oracle 12c, you can use sqlldr in express mode, thereby you don't need any control file.
In Oracle Database 12c onwards, SQLLoader has a new feature called
express mode that makes loading CSV files faster and easier. With
express mode, there is no need to write a control file for most CSV
files you load. Instead, you can load the CSV file with just a few
parameters on the SQLLoader command line.
An example
Imagine I have a table like this
CREATE TABLE EMP
(EMPNO number(4) not null,
ENAME varchar2(10),
HIREDATE date,
DEPTNO number(2));
Then a csv file that looks like this
7782,Clark,09-Jun-81,10
7839,King,17-Nov-81,12
I can use sqlldr in express mode :
sqlldr userid=xxx table=emp
You can read more about express mode in this white paper
Express Mode in SQLLDR
Forget about using sqlldr in a script file. Your best bet is on using an external table. This is a create table statement with sqlldr commands that will read a file from a directory and store it as a table. Super easy, really convenient.
Here is an example:
create table thisTable (
"field1" varchar2(10)
,"field2" varchar2(100)
,"field3" varchar2(100)
,"dateField" date
) organization external (
type oracle_loader
default directory <createDirectoryWithYourPath>
access parameters (
records delimited by newline
load when (fieldname != BLANK)
skip 9
fields terminated by ',' optionally ENCLOSED BY '"' ltrim
missing field values are null
(
"field1"
,"field2"
,"field3"
,"dateField" date 'mm/dd/yyyy'
)
)
location ('filename.csv')
);
I created database using SQL in hive.
And I looked for the database using HDFS.
But I couldn't find database in HDFS.
In hive:
CREATE DATABASE practice
LOCATION '/user/hive/warehouse'/
Checking:
hdfs dfs -ls /user/hive/warehouse
There is nothing in warehouse.
In addition, I created a table in a specific database in hive.
But, using Hue, I could see the table in the default location.
I wanna insert the table into a specific database location.
CREATE TABLE prac (
id INT,
title STRING,
salary INT,
posted TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/practice.db/prac';
I couldn't find the table prac in the database practice in Hue and HDFS.
How can I see the database in HDFS?
And I also wanna know how to see the table in the specific database location.
Try by specifying db name while creating hive table prac by default hive creates tables in default database.
Example:
hive> CREATE DATABASE practice LOCATION '/user/hive/warehouse/practice.db';
hive> CREATE TABLE `practice.prac` ( id INT, title STRING, salary INT, posted TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/warehouse/practice.db/prac';
Try using below command:
CREATE DATABASE practice
LOCATION '/user/hive/warehouse/practice.db';
hive by default uses '/user/hive/warehouse/' directory to create databases under this location. So while creating database, If you don't provide the location it will pick the database location like this '/user/hive/warehouse/practice.db'.
You can choose any location over hdfs until you have read and write permission over that location.
I have a very flat, simple log file (6 rows of which one row is blank) that I want to insert into a simple 5 column SQL Server table.
Please excuse my SQL ignorance as my knowledge around this topic is not educated.
Below is the .log file content :-
-----------Log File content start----------
07/30/2016 00:02:03 : BATCH CLOSE SUMMARY
MerchantID - 000022673665
TerminalID - 013
BatchItemCount - 650
NetBatchTotal - 5095.00
----------Log file content end-------------
Below is the simple SQL Server table layout:
CREATE TABLE dbo.CCClose
(
CloseTime NVARCHAR(50) NOT NULL,
MercID NVARCHAR(50) NOT NULL,
TermID NVARCHAR(50) NOT NULL,
BatchCount NVARCHAR(30) NOT NULL,
NetBatcTotal NVARCHAR(50) NOT NULL
);
I'm hoping that somehow have each row looked at by SQL for example:
if .log file like 'Batch close Summary' then insert into CloseTime else
if .log file like 'MerchantID' then insert into MercID else
if .log file like 'BatchItemCount' then insert into BatchCount else
if .log file like 'NetBatchTotal' then insert into NetBatchTotal
Off course it would be great if the proper formatting for each column was in place but at this time I just looking at getting the .log file data populated from a directory of these logs.
I plan to use Crystal Reports to build on the SQL Server tables.
This is not going to be a simple process. You can probably do it with bulk insert. The idea is to read it into a staging table, using:
a record terminator of something like "----------Log file content end-------------" + newline
a field separator of a newline
a staging table with several columns of varchars
Then process the staging table to extract the values (and types) that you want. There are probably other options, if you set up a format file, but that adds another level of complexity.
I would read the table into a staging table with one line per row in the table. Then, I would:
use window functions to assign a record number to rows, based on the "content start" lines
aggregate based on the record number
extract the values using aggregations, string functions, and conversions
In SQL Data Warehouse (editors please don't change this, it is the actual name see: here) I have a JobCandidate_ext external table that looks like this.
CREATE EXTERNAL TABLE [HumanResources].[JobCandidate_ext](
[JobCandidateID] int,
[BusinessEntityID] int,
[Resume] Varchar(8000),
[ModifiedDate] Datetime
)
WITH (
LOCATION='/[HumanResources].[JobCandidate]/data.txt',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile)
GO
The column [Resume] was an XML type in SQL Server but in SQL Data Warehouse XML types should be converted to varchar(8000) as described here.
I am using a flat file data.txt to export the data to a blob and then create an external table from it.
The [Resume] column has carriage returns in it (as expected from an XML file), and so when you run a SELECT * FROM [HumanResources].[JobCandidate_ext] you get an error. In this case:
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 2 rows processed.
(/[HumanResources].[JobCandidate]/data.txt)Column ordinal: 0, Expected data type: INT, Offending value: some text .... (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.
I know that I cannot configure a row delimiter when creating external tables as described here.
The row delimiter must be UTF-8 and supported by Hadoop’s LineRecordReader. The row delimiter must be either '\r', '\n', or '\r\n'. These are not user-configurable.
And if you try to put quotes on each column field you get this error while selecting rows from the external table: No closing string delimiter.
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/[HumanResources].[JobCandidate]/data.txt)Column ordinal: 2, Expected data type: VARCHAR(8000) collate SQL_Latin1_General_CP1_CI_AS, Offending value: 'ShaiBassli (Tokenization failed), Error: No closing string delimiter.
Is there a way to get around this issue?
Today, PolyBase does not allow for row or field delimiters inside fields i.e. it does not allow you to escape these characters. As Greg pointed out, you can vote for this functionality here: https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10600132-polybase-allow-line-ends-within-qualified-text-f
To workaround this limitation, you can either pre-process the data (using sed or tr for example) to replace unwanted characters before reading it with PolyBase. Or you can switch to other polybase supported file formats RCFile/ORC/Parquet to avoid dealing with row and field delimiters completely.