Loading Array of string in Hive - arrays

I am trying to load following data in hive table
data which i am loading
I used following table defination
CREATE EXTERNAL TABLE IF NOT EXISTS YOUTUBE_DATA (
VIDEO_ID STRING,
UPLOADER STRING,
INTERVAL INT,
CATEGORY STRING,
VIDEO_LEN INT,
VIEW_NO INT,
RATING FLOAT,
NO_COMMENTS INT,
RELATED_VIDEOS ARRAY
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
location '/five';
LOAD DATA INPATH '/DATA/YOUTUBEDATA' OVERWRITE INTO TABLE YOUTUBE_DATA;
on running the query select * from youtube_data limit 10;
following was output.
output of query
can anyone please help with mistake which I am doing and with solution ??

Related

Storing binary with JDBCTemplate

I have table like
create table test(payload varbinary(max))
I am trying to store text lines in compressed format in the database using the following code
String sql = "insert into test(payload) values (compress(:payload))
MapSqlParametersource msps = new MapSqlParameterSource();
msps.addValue("payload", "some text", Types.VARBINARY)
NamedParameterJdbcTemplate npjt = //;
npjt.update(sql, msps);
This gives the following error -
String is not in a valid hex format
If I provide the datatype in MapSqlParameterSource as VARCHAR, it doesn't give any error, but then using MSSQL's decompress function returns garbage value
select decompress(payload) from test

Partial loading of data(columns) from a parquet file into relational table

I have a table of data type VARIANT that holds a parquet file.
I have a relational table of the format
> CREATE TABLE IF NOT EXISTS covid_data_relational
> (
> id int identity(1,1),
> date_dt DATE,
> state string,
> value int,
> population_percent float,
> change_from_prior_day int,
> seven_day_change_percent float
> );
Inserting data into this table populates records only for columns "date" and "state". Any column that has the format cases.* is not populated.
The query is as follows:
insert into covid_data_relational(date_dt, state, value, population_percent, change_from_prior_day, seven_day_change_percent)
select covid_data_raw:date::date as date_dt,
covid_data_raw:state::string as state,
covid_data_raw:cases.value::int as value,
covid_data_raw:cases.calculated.population_percent::float as population_percent,
covid_data_raw:cases.calculated.change_from_prior_day::int as change_from_prior_day,
covid_data_raw:cases.calculated.seven_day_change_percent::float as seven_day_change_percent
from covid_data_parquet;
Any help is appreciated! Thanks in advance
It would be better to post sample data as text rather than an image. People can more easily copy and paste the sample to test. So this isn't tested, but it should work.
Take a close look at the cases.calculated.change_from_prior_day key. It is not a nested key in the grandparent path cases.*. It is a hard-coded key formed by a long string with dots.
To extract that, you'll need to specify that the key is the whole string including the dots:
insert into covid_data_relational(date_dt, state, value, population_percent, change_from_prior_day, seven_day_change_percent)
select covid_data_raw:date::date as date_dt,
covid_data_raw:state::string as state,
covid_data_raw:"cases.value"::int as value,
covid_data_raw:"cases.calculated.population_percent"::float as population_percent,
covid_data_raw:"cases.calculated.change_from_prior_day"::int as change_from_prior_day,
covid_data_raw:"cases.calculated.seven_day_change_percent"::float as seven_day_change_percent
from covid_data_parquet;

String delimiter present in string not permitted in Polybase?

I'm creating an external table using a CSV stored in an Azure Data Lake Storage and populating the table using Polybase in SQL Server.
However, I ran into this problem and figured it may be due to the fact that in one particular column there are double quotes present within the string, and the string delimiter has been specified as " in Polybase (STRING_DELIMITER = '"').
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter
Example:
I have done quite an extensive research in this and found that this issue has been around for years but yet to see any solutions given.
Any help will be appreciated.
I think the easiest way to fix this up because you are in charge of the .csv creation is to use a delimiter which is not a comma and leave off the string delimiter. Use a separator which you know will not appear in the file. I've used a pipe in my example, and I clean up the string once it is imported in to the database.
A simple example:
IF EXISTS ( SELECT * FROM sys.external_tables WHERE name = 'delimiterWorking' )
DROP EXTERNAL TABLE delimiterWorking
GO
IF EXISTS ( SELECT * FROM sys.tables WHERE name = 'cleanedData' )
DROP TABLE cleanedData
GO
IF EXISTS ( SELECT * FROM sys.external_file_formats WHERE name = 'ff_delimiterWorking' )
DROP EXTERNAL FILE FORMAT ff_delimiterWorking
GO
CREATE EXTERNAL FILE FORMAT ff_delimiterWorking
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
--STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE delimiterWorking (
id INT NOT NULL,
body VARCHAR(8000) NULL
)
WITH (
LOCATION = 'yourLake/someFolder/delimiterTest6.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_delimiterWorking,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
GO
SELECT *
FROM delimiterWorking
GO
-- Fix up the data
CREATE TABLE cleanedData
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
id,
body AS originalCol,
SUBSTRING ( body, 2, LEN(body) - 2 ) cleanBody
FROM delimiterWorking
GO
SELECT *
FROM cleanedData
My results:
String Delimiter issue can be avoided if you have the Data lake flat file converted to Parquet format.
Input:
"ID"
"NAME"
"COMMENTS"
"1"
"DAVE"
"Hi "I am Dave" from"
"2"
"AARO"
"AARO"
Steps:
1 Convert Flat file to Parquet format [Using Azure Data factory]
2 Create External File format in Data Lake [Assuming Master key, Scope credentials available]
CREATE EXTERNAL FILE FORMAT PARQUET_CONV
WITH (FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
3 Create External Table with FILE_FORMAT = PARQUET_CONV
Output:
I believe this is the best option as Microsoft don't have an solution currently to handle this string delimiter occurring with in the data for External table

Loading data to Snowflake target: End of record reached while expected to parse column

Observing Error:
End of record reached while expected to parse column '"DIM_EQUIPMENT_UNIT"["EQUNIT_TARE_WEIGHT_TNE":20]' File 'DIM_EQUIPMENT_UNIT_issue_nulll_end.csv', line 17, character 133 Row 16, column "DIM_EQUIPMENT_UNIT"["EQUNIT_TARE_WEIGHT_TNE":20]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
For more information on loading options, please run 'info loading_data' in a SQL client.
Sample Data:
3499933,00010101,99991231,"Y","TSXZ 622095",0,3,-2,-1,5,-2,"1","2017-03-24-17.25.42.000000","COMPASS ",5391,13,-2,"n/a ","n/a",+0.00000000000000E+000
3499948,00010101,99991231,"Y","EOLU 8888370",0,1,28,-1,3,-2,"1","2018-04-26-17.35.47.000000","COMPASS ",5799,-2,-2,"n/a ","n/a",+3.69000000000000E+000
3499968,00010101,99991231,"Y","NONZ 7086849",0,3,-2,-1,5,-2,"1","2017-03-24-17.25.42.000000","COMPASS ",5391,13,-2,"n/a ","n/a",+0.00000000000000E+000
3499992,00010101,99991231,"Y","SGPU 1240279",0,1,31,-1,3,-2,"1","2019-05-22-17.29.11.000000","COMPASS ",6203,-2,-2,"n/a ","n/a",+3.05000000000000E+000
109267,00010101,99991231,"Y","CTSU 425ß85 ",0,1,46,-1,3,-2,"1","2011-05-16-08.52.08.000000","COMPASS ",98,-2,-2,"n/a ","n/a",
DDL:
CREATE OR REPLACE TABLE DIM_EQUIPMENT_UNIT(
EQUNIT_ID NUMBER,
EQUNIT_VLD_FROM DATE,
EQUNIT_VLD_TO DATE,
EQUNIT_VLD_FLAG VARCHAR(1),
EQUNIT_UNIT_NUMBER VARCHAR(13),
EQUNIT_CONSTRUCTION_YEAR NUMBER,
FK_TW2130EQCAT NUMBER,
FK_TW0020EQT NUMBER,
FK_TW2160EQSERIES NUMBER,
FK_TW0050OWS NUMBER,
FK_TW059VEQLESSOR NUMBER,
EQUNIT_CLIENT VARCHAR(1),
EQUNIT_LC TIMESTAMP_NTZ,
EQUNIT_CB VARCHAR(8),
EQUNIT_LOAD_CYCLE_ID NUMBER,
FK_TW0820CHT NUMBER,
FK_TW0850GST NUMBER,
EQUNIT_SAP_ASSET_NUMBER VARCHAR(11),
EQUNIT_PRE_INTEGRATION_OWNER VARCHAR(3),
EQUNIT_TARE_WEIGHT_TNE FLOAT
);
COPY Command Used:
COPY INTO "DIM_EQUIPMENT_UNIT" FROM '#SNOWFLAKE_STAGE/' FILES=('DIM_EQUIPMENT_UNIT_issue_nulll_end.csv') on_error='abort_statement'
file_format=(type=csv SKIP_HEADER=1 FIELD_OPTIONALLY_ENCLOSED_BY='"' ESCAPE_UNENCLOSED_FIELD = None ENCODING ='WINDOWS1252'
EMPTY_FIELD_AS_NULL=true NULL_IF = ('NULL', 'null', '\N') TIMESTAMP_FORMAT='YYYY-MM-DD-HH24.MI.SS.FF')
So if you see the last record in your sample data. 18th and 19th column values are "n/a". There is nothing on column number 20. Even it is supposed to be null, it should spit out in the data, like "" or "\N" or NULL.
Since it doesn't have anything, it's giving you end of file error on that column.
Now, you can do either of the below 2 things,
a. make sure your sample file has an exact 20 columns,
b. If you cannot do that, and if you're ok to ignore the row, change the on-error in copy statement to on_error='continue'
This will ignore this row and move forward.

unable to insert data in hive having custom delimiter

I am trying to learn hive, this may be stupid a question but
I created a table in hive as follows
create table if not exists tweets_table(
tweetdata STRING,
followerscount INT,
friendscount INT,
statuscount INT,
retweetcount INT,
favouritescount INT,
lang STRING,
placefullname STRING,
placename STRING,
countryname STRING,
countrycode STRING,
hashtags STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^**^'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/TestDB.txt' INTO TABLE tweets_table5 ;
I have used '^**^' to delimit the text because the tweets has a lot of "\n \r ," (please suggest a standard practice if any)
So I have a text file which I am trying to load
09-09-2016 10:51:33|^**^|#ArvindKejriwal #abpnewstv तुम्हारे दावों का क्या हुआ केजरीवाल।|^**^|74|^**^|30|^**^|0|^**^|98|^**^|0|^**^|49|^**^|en|^**^|Ambikapur, India|^**^|Ambikapur|^**^|India|^**^|IN|^**^|[]
09-09-2016 10:51:37|^**^|#LiveLawIndia It is shocking a judge arrested. I am sure Higher Judiciary will come their rescue , Judges per se cannot be wrong|^**^|0|^**^|14|^**^|0|^**^|32|^**^|0|^**^|2|^**^|en|^**^|Rajasthan, India|^**^|Rajasthan|^**^|India|^**^|IN|^**^|[]
After successfully loading it and querying it,
I get the following output
09-09-2016 10:51:33| NULL NULL NULL NULL NULL |30| ** |0| ** |98| **
09-09-2016 10:51:37| NULL NULL NULL NULL NULL |14| ** |0| ** |32| **
I fail to understand where am I going wrong is it in my tex tfile or the hive table. Please help
Several issues with what you're trying to do:
Using FIELDS TERMINATED BY You can't have a delimiter that is more than 1 char.
Even if that worked, that doesn't solve the problem that your tweets have line delimiters in them - each \n in a tweet starts a new row.
The way you describe your table - it's impossible to parse correctly - you can't have \n as line delimiters and in the tweet data. If you are the one generating this input file, I would suggest replacing all \n and \r in the tweets with spaces.
Create table with regex serde instead of default hive serde.
Modify below regex based on number of columnns.:
^(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)$
Table:
create external table if not exists tweets_table(
tweetdata STRING,
followerscount INT,
friendscount INT,
statuscount INT,
retweetcount INT,
favouritescount INT,
lang STRING,
placefullname STRING,
placename STRING,
countryname STRING,
countrycode STRING,
hashtags STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = " ^(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)\|\^\*\*\^\|(.+)$",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s"
)
STORED AS TEXTFILE;
Load data:
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/TestDB.txt' INTO TABLE tweets_table ;
If you get RegexSerDe classNotFoundException then add regex serde jar:
ADD JAR hive-contrib=x.x.x.jar

Resources