unable to insert data in hive having custom delimiter - file

I am trying to learn hive, this may be stupid a question but
I created a table in hive as follows
create table if not exists tweets_table(
tweetdata STRING,
followerscount INT,
friendscount INT,
statuscount INT,
retweetcount INT,
favouritescount INT,
lang STRING,
placefullname STRING,
placename STRING,
countryname STRING,
countrycode STRING,
hashtags STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^**^'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/TestDB.txt' INTO TABLE tweets_table5 ;
I have used '^**^' to delimit the text because the tweets has a lot of "\n \r ," (please suggest a standard practice if any)
So I have a text file which I am trying to load
09-09-2016 10:51:33|^**^|#ArvindKejriwal #abpnewstv तुम्हारे दावों का क्या हुआ केजरीवाल।|^**^|74|^**^|30|^**^|0|^**^|98|^**^|0|^**^|49|^**^|en|^**^|Ambikapur, India|^**^|Ambikapur|^**^|India|^**^|IN|^**^|[]
09-09-2016 10:51:37|^**^|#LiveLawIndia It is shocking a judge arrested. I am sure Higher Judiciary will come their rescue , Judges per se cannot be wrong|^**^|0|^**^|14|^**^|0|^**^|32|^**^|0|^**^|2|^**^|en|^**^|Rajasthan, India|^**^|Rajasthan|^**^|India|^**^|IN|^**^|[]
After successfully loading it and querying it,
I get the following output
09-09-2016 10:51:33| NULL NULL NULL NULL NULL |30| ** |0| ** |98| **
09-09-2016 10:51:37| NULL NULL NULL NULL NULL |14| ** |0| ** |32| **
I fail to understand where am I going wrong is it in my tex tfile or the hive table. Please help

Several issues with what you're trying to do:
Using FIELDS TERMINATED BY You can't have a delimiter that is more than 1 char.
Even if that worked, that doesn't solve the problem that your tweets have line delimiters in them - each \n in a tweet starts a new row.
The way you describe your table - it's impossible to parse correctly - you can't have \n as line delimiters and in the tweet data. If you are the one generating this input file, I would suggest replacing all \n and \r in the tweets with spaces.

Create table with regex serde instead of default hive serde.
Modify below regex based on number of columnns.:
^(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)$
Table:
create external table if not exists tweets_table(
tweetdata STRING,
followerscount INT,
friendscount INT,
statuscount INT,
retweetcount INT,
favouritescount INT,
lang STRING,
placefullname STRING,
placename STRING,
countryname STRING,
countrycode STRING,
hashtags STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = " ^(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)$",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s"
)
STORED AS TEXTFILE;
Load data:
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/TestDB.txt' INTO TABLE tweets_table ;
If you get RegexSerDe classNotFoundException then add regex serde jar:
ADD JAR hive-contrib=x.x.x.jar

Related

String delimiter present in string not permitted in Polybase?

I'm creating an external table using a CSV stored in an Azure Data Lake Storage and populating the table using Polybase in SQL Server.
However, I ran into this problem and figured it may be due to the fact that in one particular column there are double quotes present within the string, and the string delimiter has been specified as " in Polybase (STRING_DELIMITER = '"').
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter
Example:
I have done quite an extensive research in this and found that this issue has been around for years but yet to see any solutions given.
Any help will be appreciated.
I think the easiest way to fix this up because you are in charge of the .csv creation is to use a delimiter which is not a comma and leave off the string delimiter. Use a separator which you know will not appear in the file. I've used a pipe in my example, and I clean up the string once it is imported in to the database.
A simple example:
IF EXISTS ( SELECT * FROM sys.external_tables WHERE name = 'delimiterWorking' )
DROP EXTERNAL TABLE delimiterWorking
GO
IF EXISTS ( SELECT * FROM sys.tables WHERE name = 'cleanedData' )
DROP TABLE cleanedData
GO
IF EXISTS ( SELECT * FROM sys.external_file_formats WHERE name = 'ff_delimiterWorking' )
DROP EXTERNAL FILE FORMAT ff_delimiterWorking
GO
CREATE EXTERNAL FILE FORMAT ff_delimiterWorking
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
--STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE delimiterWorking (
id INT NOT NULL,
body VARCHAR(8000) NULL
)
WITH (
LOCATION = 'yourLake/someFolder/delimiterTest6.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_delimiterWorking,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
GO
SELECT *
FROM delimiterWorking
GO
-- Fix up the data
CREATE TABLE cleanedData
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
id,
body AS originalCol,
SUBSTRING ( body, 2, LEN(body) - 2 ) cleanBody
FROM delimiterWorking
GO
SELECT *
FROM cleanedData
My results:
String Delimiter issue can be avoided if you have the Data lake flat file converted to Parquet format.
Input:
"ID"
"NAME"
"COMMENTS"
"1"
"DAVE"
"Hi "I am Dave" from"
"2"
"AARO"
"AARO"
Steps:
1 Convert Flat file to Parquet format [Using Azure Data factory]
2 Create External File format in Data Lake [Assuming Master key, Scope credentials available]
CREATE EXTERNAL FILE FORMAT PARQUET_CONV
WITH (FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
3 Create External Table with FILE_FORMAT = PARQUET_CONV
Output:
I believe this is the best option as Microsoft don't have an solution currently to handle this string delimiter occurring with in the data for External table

Loading data to Snowflake target: End of record reached while expected to parse column

Observing Error:
End of record reached while expected to parse column '"DIM_EQUIPMENT_UNIT"["EQUNIT_TARE_WEIGHT_TNE":20]' File 'DIM_EQUIPMENT_UNIT_issue_nulll_end.csv', line 17, character 133 Row 16, column "DIM_EQUIPMENT_UNIT"["EQUNIT_TARE_WEIGHT_TNE":20]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
For more information on loading options, please run 'info loading_data' in a SQL client.
Sample Data:
3499933,00010101,99991231,"Y","TSXZ 622095",0,3,-2,-1,5,-2,"1","2017-03-24-17.25.42.000000","COMPASS ",5391,13,-2,"n/a ","n/a",+0.00000000000000E+000
3499948,00010101,99991231,"Y","EOLU 8888370",0,1,28,-1,3,-2,"1","2018-04-26-17.35.47.000000","COMPASS ",5799,-2,-2,"n/a ","n/a",+3.69000000000000E+000
3499968,00010101,99991231,"Y","NONZ 7086849",0,3,-2,-1,5,-2,"1","2017-03-24-17.25.42.000000","COMPASS ",5391,13,-2,"n/a ","n/a",+0.00000000000000E+000
3499992,00010101,99991231,"Y","SGPU 1240279",0,1,31,-1,3,-2,"1","2019-05-22-17.29.11.000000","COMPASS ",6203,-2,-2,"n/a ","n/a",+3.05000000000000E+000
109267,00010101,99991231,"Y","CTSU 425ß85 ",0,1,46,-1,3,-2,"1","2011-05-16-08.52.08.000000","COMPASS ",98,-2,-2,"n/a ","n/a",
DDL:
CREATE OR REPLACE TABLE DIM_EQUIPMENT_UNIT(
EQUNIT_ID NUMBER,
EQUNIT_VLD_FROM DATE,
EQUNIT_VLD_TO DATE,
EQUNIT_VLD_FLAG VARCHAR(1),
EQUNIT_UNIT_NUMBER VARCHAR(13),
EQUNIT_CONSTRUCTION_YEAR NUMBER,
FK_TW2130EQCAT NUMBER,
FK_TW0020EQT NUMBER,
FK_TW2160EQSERIES NUMBER,
FK_TW0050OWS NUMBER,
FK_TW059VEQLESSOR NUMBER,
EQUNIT_CLIENT VARCHAR(1),
EQUNIT_LC TIMESTAMP_NTZ,
EQUNIT_CB VARCHAR(8),
EQUNIT_LOAD_CYCLE_ID NUMBER,
FK_TW0820CHT NUMBER,
FK_TW0850GST NUMBER,
EQUNIT_SAP_ASSET_NUMBER VARCHAR(11),
EQUNIT_PRE_INTEGRATION_OWNER VARCHAR(3),
EQUNIT_TARE_WEIGHT_TNE FLOAT
);
COPY Command Used:
COPY INTO "DIM_EQUIPMENT_UNIT" FROM '#SNOWFLAKE_STAGE/' FILES=('DIM_EQUIPMENT_UNIT_issue_nulll_end.csv') on_error='abort_statement'
file_format=(type=csv SKIP_HEADER=1 FIELD_OPTIONALLY_ENCLOSED_BY='"' ESCAPE_UNENCLOSED_FIELD = None ENCODING ='WINDOWS1252'
EMPTY_FIELD_AS_NULL=true NULL_IF = ('NULL', 'null', '\N') TIMESTAMP_FORMAT='YYYY-MM-DD-HH24.MI.SS.FF')
So if you see the last record in your sample data. 18th and 19th column values are "n/a". There is nothing on column number 20. Even it is supposed to be null, it should spit out in the data, like "" or "\N" or NULL.
Since it doesn't have anything, it's giving you end of file error on that column.
Now, you can do either of the below 2 things,
a. make sure your sample file has an exact 20 columns,
b. If you cannot do that, and if you're ok to ignore the row, change the on-error in copy statement to on_error='continue'
This will ignore this row and move forward.

find number in json array value with regex

i want match string in json string that like:
"ids":[44,53,1,3,12,45]
i want run query in sqlite send only one digit as id and match one of the above id in sql statement
i write this regex "ids":[\[] for matching start of key
but i don't have any idea to match middle id and escape starting id
example:
i have calc_method table like this:
CREATE TABLE "calc_method" (
"calc_method_id" INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
"calc_method_name" TEXT NOT NULL,
"calc_method_value" TEXT NOT NULL
);
in calc_method_value column i store calcMethod class which convert to json using Gson
class calcMethod{
var memberCafeIds:ArrayList<Long>,
var memberBarIds:ArrayList<Long>
}
after i convert calcMethod to json i have output like below and this value store in calc_method_value column:
{"memberCafeIds":[1,2,14,5,44],"memberBarIds":[23,1,5,78]}
now i want select row that match to my regex pattern like if calc_method_value column have memberBarIds with id 1
SELECT * FROM calc_method WHERE calc_method_value REGEXP '"memberCafeIds":\[[:paramId]'
:paramId is method parameter
Regards, a programmer struggle with regex
In Sqlite, use JSON1 functions to work with JSON, not regular expressions. In particular, json_each() to turn the JSON array into a table you can query:
sqlite> CREATE TABLE ex(json);
sqlite> INSERT INTO ex VALUES ('{"ids":[44,53,1,3,12,45]}');
sqlite> SELECT * FROM ex WHERE 1 IN (SELECT value FROM json_each(ex.json, '$.ids'));
json
-------------------------
{"ids":[44,53,1,3,12,45]}
sqlite> SELECT * FROM ex WHERE 50 IN (SELECT value FROM json_each(ex.json, '$.ids'));
sqlite>

SSIS - remove character X unless it's followed by character Y

Let's say I have the following dataset imported from a textfile:
Data
--------------------
1,"John Davis","Germany"
2,"Mike Johnson","Texas, USA"
3,"Bill "The man" Taylor","France"
I am looking for a way to remove every " in the data, unless it's followed or preceded by a ,.
So in my case, the data should become:
Data
--------------------
1,"John Davis","Germany"
2,"Mike Johnson","Texas, USA"
3,"Bill The man Taylor","France"
I tried it with the import tekst file component in SSIS, but that gives an error when I set the column delimiter to ". If I don't set a delimiter, it sees the comma in "Texas, USA" as a split delimiter....
Any suggestions/ideas? The textfile is too large to change this manually for every line so that's not an option.
Bit of a cop-out on the last '"', but:
Create table #test ([Data] nvarchar(max))
insert into #test values ('1,"John Davis","Germany"' )
insert into #test values ('2,"Mike Johnson","Texas, USA"' )
insert into #test values ('3,"Bill "The man" Taylor","France"')
select replace(replace(replace(replace([Data],',"',',~'), '",','~,'),'"', ''),'~','"') + '"'
from #test

Loading Array of string in Hive

I am trying to load following data in hive table
data which i am loading
I used following table defination
CREATE EXTERNAL TABLE IF NOT EXISTS YOUTUBE_DATA (
VIDEO_ID STRING,
UPLOADER STRING,
INTERVAL INT,
CATEGORY STRING,
VIDEO_LEN INT,
VIEW_NO INT,
RATING FLOAT,
NO_COMMENTS INT,
RELATED_VIDEOS ARRAY
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
location '/five';
LOAD DATA INPATH '/DATA/YOUTUBEDATA' OVERWRITE INTO TABLE YOUTUBE_DATA;
on running the query select * from youtube_data limit 10;
following was output.
output of query
can anyone please help with mistake which I am doing and with solution ??

Resources