I am trying to load a .csv table to MS SQL Server via Azure Data Factory, but I have a problem with the delimiter (;) since it appears as a character in some of the values included in some columns.
As a result, I get an error saying in the details "found more columns than expected column count".
Is there any way to change the delimiter directly on ADF before/while loading the .csv table (ex.: making it from ";" to "|||")?
Thanks in advance!
I have a problem with the delimiter (;) since it appears as a
character in some of the values included in some columns.
As you have quoted that your delimiter is ; but it is occurring as a character in some of the columns which means that there is no specific pattern of the occurrence. Hence, it is not possible in ADF.
The recommendation is to write a program using any preferred language (like python) which will iterate each row from the dataset and write a logic to replace the delimiter to ||| or you can also remove the unrequired ; and append the changes in new file. Later you can ingest this new file in ADF.
Related
I'm taking a CSV file from HDFS and transferring it to my External Table in hive.
But my CSV file has the delimiter " ; " and in my second column, I have " ; " along with the information.
You can see in the image below:
Can you guide me what I should do? Are there any Hive properties that allow me to do this or any other solution?
By default, ROW FORMAT TEXT FIELDS TERMINATED BY ';' will split it apart
If you want the (OS) value to be part of the second column, you need to quote that column. e.g. A;"Mozilla//5.0;(Linux)";BR. In other words, change how the file is written/stored outside of Hive
If you cannot modify the file, you can make your queries simply concatenate those two columns, e.g. SELECT CONCAT(user_agent, ';', os) FROM data;
Requirement:
Need the file to be exported as below format, where gender, age, and interest are columns and value after : is data for that column. Can this be achieved while using Snowflake, if not is it possible to export data using Python
User1234^gender:male;age:18-24;interest:fishing
User2345^gender:female
User3456^age:35-44
User4567^gender:male;interest:fishing,boating
EDIT 1: Solution as given by #demircioglu
It displays as NULL values instead of other column values
Below the EMPLOYEES table data
When I ran below query
SELECT 'EMP_ID'||EMP_ID||'^'||'FIRST_NAME'||':'||FIRST_NAME||';'||'LAST_NAME'||':'||LAST_NAME FROM tempdw.EMPLOYEES ;
Create your SQL with the desired format and write it to a file
COPY INTO #~/stage_data
FROM
(
SELECT 'User'||User||'^'||'gender'||':'||gender||';'||'age'||':'||age||';'||'interest'||':'||interest FROM table
)
file_format = (TYPE=CSV compression='gzip')
File format here is not important because each line will be treated as a field because of your delimiter requirements
Edit:
CONCAT function (aliased with ||) returns NULL if you have a NULL value.
In order to eliminate NULLs you can use NVL2 function
So your SQL will have series of NVL2s
NVL2 checks the first parameter and if it's not NULL returns first expression, if it's NULL returns second expression
So for User column
'User'||User||'^' will turn into
NVL2(User,'User','')||NVL2(User,User,'')||NVL2(User,'^','')
P.S. I am leaving up to you to create the rest of the SQL, because Stackoverflow's function is to help find the solution, not spoon feed the solution.
No, I do not believe multiple delimiters like this are supported in Snowflake at this time. Multiple byte and multiple character delimiters are supported, but they will need to be specified as the same delimiter repeated for either record or line.
Yes, it may be possible to do some post-processing or use Python scripts to achieve this. Or even SQL transformative statements. This is not really my area of expertise so if someone has an example for you, I'll let them add to the discussion.
I am trying to load data from an Excel .csv file to a flat file format to use as a datasource in a Data Services job data flow which then transfers the data to an SQL-Server (2012) database table.
I consistently lose 1 in 6 records.
I have tried various parameter values in the file format definition and settled on setting Adaptable file scheme to "Yes", file type "delimited", column delimeter "comma", row delimeter {windows new line}, Text delimeter ", language eng(English) and all else as defaults.
I have also set "write errors to file" to "yes" but it just creates an empty error file (I expected the 6,000 odd unloaded rows to be in here).
If we strip out three of the columns containing special characters (visible in XL) it loads a treat so I think these characters are the problem.
The thing is, we need the data in those columns and unfortunately, this .csv file is as good a data source as we are likely to get and it is always likely to contain special characters in these three columns so we need to be able to read it in if possible.
Should I try to specifically strip the columns in the Query source component of the dataflow? Am I missing a data-cleansing trick in the query or file format definition?
OK so didn't get the answer I was looking for but did get it to work by setting the "Row within Text String" parameter to "Row delimiter".
I have data in the csv file similar to this:
Name,Age,Location,Score
"Bob, B",34,Boston,0
"Mike, M",76,Miami,678
"Rachel, R",17,Richmond,"1,234"
While trying to BULK INSERT this data into a SQL Server table, I encountered two problems.
If I use FIELDTERMINATOR=',' then it splits the first (and sometimes the last) column
The last column is an integer column but it has quotes and comma thousand separator whenever the number is greater than 1000
Is there a way to import this data (using XML Format File or whatever) without manually parsing the csv file first?
I appreciate any help. Thanks.
You can parse the file with http://filehelpers.sourceforge.net/
And with that result, use the approach here: SQL Bulkcopy YYYYMMDD problem or straight into SqlBulkCopy
Use MySQL load data:
LOAD DATA LOCAL INFILE 'path-to-/filename.csv' INTO TABLE `sql_tablename`
CHARACTER SET 'utf8'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES;
The part optionally enclosed by '\"', or escape character and quote, will keep the data in the first column together for the first field.
IGNORE 1 LINES will leave the field name row out.
UTF8 line is optional but good to use if names have diacritics, like in José.
I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.