Modify the delimiter of an external table with HiveQL - database

I'm taking a CSV file from HDFS and transferring it to my External Table in hive.
But my CSV file has the delimiter " ; " and in my second column, I have " ; " along with the information.
You can see in the image below:
Can you guide me what I should do? Are there any Hive properties that allow me to do this or any other solution?

By default, ROW FORMAT TEXT FIELDS TERMINATED BY ';' will split it apart
If you want the (OS) value to be part of the second column, you need to quote that column. e.g. A;"Mozilla//5.0;(Linux)";BR. In other words, change how the file is written/stored outside of Hive
If you cannot modify the file, you can make your queries simply concatenate those two columns, e.g. SELECT CONCAT(user_agent, ';', os) FROM data;

Related

ADF COPY ACTIVITY unable to identify the (") double quotes in between the column value while loading CSV files to snowflake

I'm facing the issue in ADF copy activity while loading the CSV data to the snowflake table,
The issue is while loading the CSV file to the snowflake table using the ADF COPY ACTIVITY, it is treating data of a single column as a multiple columns data,
for example: "My brother often watches different cricket shows on different ""screens"", but on the same different platform"
This is the value of single column_A but ADF copy activity is reading as a value for two-column instead of one
i.e col_A=My brother often watches different cricket shows on different ""screens"
col_B= but in the same different platform
But I want this value to be in single-column i,e column_A
column_A="My brother often watches different cricket shows on different ""screens"", but on the same different platform"
Any alternatives I could do for this?
In your source data, the column value contains comma , and double quotes " which are the same as your dataset properties column delimiter and Quote character.
Column delimiter is to separate the column based on the given delimiter value.
If the column value also contains the same delimiter character, the quote character is used to identify the complete value as a single column
Example:
sample data : "1,abc",def
Preview of data in Azure Data Factory dataset:
In your case you have both column delimiter and quotes character within your column value, so it is not identified as a single column but instead separated into 2 columns based on dataset property values (comma , and double-quotes ".)
Your sample data :
"My brother often watches different cricket shows on different ""screens"", but on the same different platform"
To fix this you can change the column delimiter in your source file or replace double quotes within column value with something else.
Example:

Changing .csv delimiter on ADF

I am trying to load a .csv table to MS SQL Server via Azure Data Factory, but I have a problem with the delimiter (;) since it appears as a character in some of the values included in some columns.
As a result, I get an error saying in the details "found more columns than expected column count".
Is there any way to change the delimiter directly on ADF before/while loading the .csv table (ex.: making it from ";" to "|||")?
Thanks in advance!
I have a problem with the delimiter (;) since it appears as a
character in some of the values included in some columns.
As you have quoted that your delimiter is ; but it is occurring as a character in some of the columns which means that there is no specific pattern of the occurrence. Hence, it is not possible in ADF.
The recommendation is to write a program using any preferred language (like python) which will iterate each row from the dataset and write a logic to replace the delimiter to ||| or you can also remove the unrequired ; and append the changes in new file. Later you can ingest this new file in ADF.

Load csv file data into tables

Created tables as below :
source:([id:`symbol$()] ric:();source:();Date:`datetime$())
property:([id:`symbol$()] Value:())
Then i have two .csv files which include two tables datas.
property.csv showing as below :
id,Value
TEST1,1
TEST2,2
source.csv showing as below :
id,ric,source,Date
1,TRST,QO,2017-07-07 11:42:30.603
2,TRST2,QOT,2018-07-07 11:42:30.603
Now , how to load csv file data into each tables one time
You can use the 0: to load delimited records. https://code.kx.com/wiki/Reference/ZeroColon
The most simple form of the function is (types; delimiter) 0: filehandle
The types should be given as their uppercase letter representations, one for each column or a blank space to ignore a column. e.g using "SJ" for source.csv would mean I wanted to read in the id column as a symbol and the value column as a long.
The delimiter specifies how each columns is separated, in your case Comma Separated Values (CSV). You can pass in the delimiter as a string "," which will treat every row as part of the data and return a nested list of the columns which you can either insert into a table with matching schema or you can append on headers and flip the dictionary manually and then flip to get a table like so: flip `id`value!("IS";",") 0: `:test.txt.
If you have column headers as the first row in the csv you can pass an enlisted delimeter enlist "," which will then use the column headers and return a table in kdb with these as the headers, which you can then rename if you see fit.
As the files you want to read in have different types for the columns and are to bed into you could create a function to read them in for examples
{x insert (y;enlist ",") 0:z}'[(`source;`property);("SSSP";"SJ");(`:source.csv;`:property.csv)]
Which would allow you to specify the name of the table that should be created, the column types and the file handle of the file.
I would suggest a timestamp instead of the (depreciated) datetime as it is stored as a long instead of a float so there will be no issues with comparison.
you can use key to list the contents of the dir ;
files: key `:.; /get the contents of the dir
files:files where files like "*.csv"; /filter the csv files
m:`property.csv`source.csv!("SJ";"JSSZ"); /create the mappings for each csv file
{[f] .[first ` vs f;();:; (m#f;enlist csv) 0: hsym f]}each files
and finally, load each csv file; please note here the directory is 'pwd', you might need to add the dir path to each file before using 0:

Import csv to SQLServer when there are spaces after the text qualifier

I have a csv file with a column GeoCodes. This uses " as text qualifier.
I am trying to import this into SQLServer using the SQL Server Import Wizard.
The problem with the data is, if there is no GeoCode the csv file will sometimes output the GeoCode as " " and then several spaces. This errors when trying to import the data as it picks up the data within the text qualifier and then there are these spaces before the next comma delimiter.
An example of the data below. The Pontypandy row is the row that errors.
Place ,Geo Codes ,Type
Northpole ,"90.0000,0.0000 ",Pole
Southpole ,"-90.0000,0.0000 ",Pole
Pyramids ,"29.9765,31.1313 ",BigTriangle
France ," ",Country
Pontypandy ," " ,City
I have to use the text qualifiers as there is a comma in the GeoCodes.
I have no say on how the data is sent to me and therefore have to deal with the data as is.
As a work around I have to do a find and replace on the data in notepad first before importing. This adds an extra step to the job that hopefully isn't needed.
Is there anyway I can get around the " " spaces during the import?
As an extra note, I don't currently have access to SSIS but if it can be done in there any easier then please answer with that as it could help me justify getting SSIS (I might have to remove this comment later if I have to show it to my manager).
If your data really is the way you show above you can use fixed width format. Import the data as is and replace the " afterwards. This is not the best solution.
Much better: pipe the import file through sed before importing. This is not only much faster, but in all cases, when data is larger than your RAM the only easy way (OK, there are some other). All you need is sed at operation system level. If you can copy the executable somewhere it's all you need. If you want to replace "[any number of blanks], with ", this is the regex should be:
cat myfile.txt|sed -b -e "s/\" *,/\",/">yournewfile.txt
The regex is easy once you get the idea:
- s means Substitute,
- /first /second/ means look for first and replace with second,
- \" is the escaped " (because of DOS)
- Space and * means any number of spaces
- , means ,
On a lot of systems sed is still there (cygwin). Have fun!
Two methods of Bulk Insert
Row-based Bulk Insert
Most Useful when you have string-qualified columns in CSV
You will need to first create a table with two-fields: identity & varchar(max); identity will signify the row-count & varchar(max) will be your row data
Create a view that only selects the varchar(max) field from the table above
Bulk Insert syntax will look something like this:
BULK INSERT AdventureWorks2012.Sales.v_SalesOrderDetail
FROM 'f:\orders\lineitem.csv'
WITH (
ROWTERMINATOR =' |\n'
);
Columnar-based Insert:
Most use this widely but is only useful and reliable when there are no string qualified columns.
Use most common Bulk Insert syntax with RowTerminator and LineTerminator options
References:
Bulk-Insert Syntax: https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql#examples
Bulk-Insert with View: https://technet.microsoft.com/en-us/library/ms179250(v=sql.105).aspx
Bulk-Insert with Table: https://technet.microsoft.com/en-us/library/ms187086(v=sql.105).aspx

Import CSV data into SQL Server

I have data in the csv file similar to this:
Name,Age,Location,Score
"Bob, B",34,Boston,0
"Mike, M",76,Miami,678
"Rachel, R",17,Richmond,"1,234"
While trying to BULK INSERT this data into a SQL Server table, I encountered two problems.
If I use FIELDTERMINATOR=',' then it splits the first (and sometimes the last) column
The last column is an integer column but it has quotes and comma thousand separator whenever the number is greater than 1000
Is there a way to import this data (using XML Format File or whatever) without manually parsing the csv file first?
I appreciate any help. Thanks.
You can parse the file with http://filehelpers.sourceforge.net/
And with that result, use the approach here: SQL Bulkcopy YYYYMMDD problem or straight into SqlBulkCopy
Use MySQL load data:
LOAD DATA LOCAL INFILE 'path-to-/filename.csv' INTO TABLE `sql_tablename`
CHARACTER SET 'utf8'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES;
The part optionally enclosed by '\"', or escape character and quote, will keep the data in the first column together for the first field.
IGNORE 1 LINES will leave the field name row out.
UTF8 line is optional but good to use if names have diacritics, like in José.

Resources