I have a pretty simple CSV file (separator ; Notepad++ says CR LF as line separator; UCS-2 Little Endian) that I need to daily import into SQL.
I spent a day now making BULK INSERT or OPENROWSET work, but I fail constantly
BULK INSERT tt_MaterialSAPx
FROM '\\192.168.89.22\LandingZone\MAT.csv'
WITH (FIRSTROW = 2
,LASTROW = 5
--,DATAFILETYPE='native'
, FIELDTERMINator = ';'
,ROWTERMINATOR = '\r'
--,Formatfile = '\\192.168.89.22\LandingZone\formatfile.fmt'
)
The BULK insert Code (it doesn't import at all if I set the rowterminator to\r\n) doesn't read the characters correctly (Basically its a "blank" between each characters - UTF16 vs UFT-8?) . Furthermore it fails if I take out the LASTROW = 5 with "Bulk load: An unexpected end of file was encountered in the data file."
The
SELECT * FROM OPENROWSET(
BULK '\\192.168.89.22\LandingZone\MAT8.csv'
, SINGLE_CLOB
--,Formatfile = '\\192.168.89.22\LandingZone\formatfile.fmt'
) AS DATA
This sticks all data into the first column and first row (so ignores field and rowterminators
Utterly frustrated: What am I missing? Or is there a third way to solve thsi?
Related
I'm creating an external table using a CSV stored in an Azure Data Lake Storage and populating the table using Polybase in SQL Server.
However, I ran into this problem and figured it may be due to the fact that in one particular column there are double quotes present within the string, and the string delimiter has been specified as " in Polybase (STRING_DELIMITER = '"').
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter
Example:
I have done quite an extensive research in this and found that this issue has been around for years but yet to see any solutions given.
Any help will be appreciated.
I think the easiest way to fix this up because you are in charge of the .csv creation is to use a delimiter which is not a comma and leave off the string delimiter. Use a separator which you know will not appear in the file. I've used a pipe in my example, and I clean up the string once it is imported in to the database.
A simple example:
IF EXISTS ( SELECT * FROM sys.external_tables WHERE name = 'delimiterWorking' )
DROP EXTERNAL TABLE delimiterWorking
GO
IF EXISTS ( SELECT * FROM sys.tables WHERE name = 'cleanedData' )
DROP TABLE cleanedData
GO
IF EXISTS ( SELECT * FROM sys.external_file_formats WHERE name = 'ff_delimiterWorking' )
DROP EXTERNAL FILE FORMAT ff_delimiterWorking
GO
CREATE EXTERNAL FILE FORMAT ff_delimiterWorking
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
--STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE delimiterWorking (
id INT NOT NULL,
body VARCHAR(8000) NULL
)
WITH (
LOCATION = 'yourLake/someFolder/delimiterTest6.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_delimiterWorking,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
GO
SELECT *
FROM delimiterWorking
GO
-- Fix up the data
CREATE TABLE cleanedData
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
id,
body AS originalCol,
SUBSTRING ( body, 2, LEN(body) - 2 ) cleanBody
FROM delimiterWorking
GO
SELECT *
FROM cleanedData
My results:
String Delimiter issue can be avoided if you have the Data lake flat file converted to Parquet format.
Input:
"ID"
"NAME"
"COMMENTS"
"1"
"DAVE"
"Hi "I am Dave" from"
"2"
"AARO"
"AARO"
Steps:
1 Convert Flat file to Parquet format [Using Azure Data factory]
2 Create External File format in Data Lake [Assuming Master key, Scope credentials available]
CREATE EXTERNAL FILE FORMAT PARQUET_CONV
WITH (FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
3 Create External Table with FILE_FORMAT = PARQUET_CONV
Output:
I believe this is the best option as Microsoft don't have an solution currently to handle this string delimiter occurring with in the data for External table
I’m pulling data from a SQL Server table using pyodbc python code.
In the output file I’m getting records like this:
1, 1, None, None, None, None, None, None
The None values are Null in the SQL table.
I’d like to see records in the text file in this format. I do not want to see the None.
1, 1, , , , , ,
Any ideas how I can do this?
Here is the code I'm using:
import pyodbc
outputfile = 'MyOut.txt'
output_data = open(outputfile, 'w+')
conn=pyodbc.connect(
r'Driver={SQL Server};'
r'Server=MyServer;’
r'Database=MyData;'
r'Trusted_Connection=yes;')
crsr = conn.cursor()
crsr.execute('select * from MyTable’)
for row in crsr:
print(str(row))
outrows = str(row).strip('(')
outrows = outrows.strip(')')
output_data.write(outrows + '\n')
output_data.close()
I understand that outrows is a string, but this would probably be made easier with an list. Aside from that, the output is probably meant to be a string, since you're writing into a txt.
You could modify your for loop as such
for row in crsr:
outrows = str(row).strip("(").strip(")")
line = outrows.split(",")
# creating the array, by splitting the string at each comma
for component in line:
if component == " None":
# with " " as there is most likely a space after the "," in the file
line[line.index(component)] = ""
file.write(",".join(line)+"\n")
I'm afraid I'm not particularly familiar with pyodbc, but I hope this was of help.
I've got some data in a string column that is in a strange csv format. I can write a file format that correctly interprets it. How do I use my file format against data that has already been imported?
create table test_table
(
my_csv_column string
)
How do I split/flatten this column with:
create or replace file format my_csv_file_format
type = 'CSV'
RECORD_DELIMITER = '0x0A'
field_delimiter = ' '
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
VALIDATE_UTF8 = FALSE
Please assume that I cannot use split, as I want to use the rich functionality of the file format (optional escape characters, date recognition etc.).
What I'm trying to achieve is something like the below (but I cannot find how to do it)
copy into destination_Table
from
(select
s.$1
,s.$2
,s.$3
,s.$4
from test_table s
file_format = (column_name ='my_csv_column' , format_name = 'my_csv_file_format'))
Let's say I have the following dataset imported from a textfile:
Data
--------------------
1,"John Davis","Germany"
2,"Mike Johnson","Texas, USA"
3,"Bill "The man" Taylor","France"
I am looking for a way to remove every " in the data, unless it's followed or preceded by a ,.
So in my case, the data should become:
Data
--------------------
1,"John Davis","Germany"
2,"Mike Johnson","Texas, USA"
3,"Bill The man Taylor","France"
I tried it with the import tekst file component in SSIS, but that gives an error when I set the column delimiter to ". If I don't set a delimiter, it sees the comma in "Texas, USA" as a split delimiter....
Any suggestions/ideas? The textfile is too large to change this manually for every line so that's not an option.
Bit of a cop-out on the last '"', but:
Create table #test ([Data] nvarchar(max))
insert into #test values ('1,"John Davis","Germany"' )
insert into #test values ('2,"Mike Johnson","Texas, USA"' )
insert into #test values ('3,"Bill "The man" Taylor","France"')
select replace(replace(replace(replace([Data],',"',',~'), '",','~,'),'"', ''),'~','"') + '"'
from #test
I am facing the problem when I am loading data in SQL server using bulk
loader(SQL loader).Here I am generating my file through java code and at
the end of the line I write line terminator.
my code is -
BULK INSERT ExcelFile
FROM 'D:\201606162171840OwnersImportFormat.csv'
WITH
(
-- FIRSTROW = 2,
FIELDTERMINATOR = ',', --CSV field delimiter
ROWTERMINATOR = '\n' ,
FirstRow = 1 --Use to shift the control to next row
-- ERRORFILE = 'D:\SchoolsErrorRows.csv',
-- TABLOCK
);
I am getting the error like Bulk load data conversion error (truncation) for
row 1, column 56 (File_upload).
Please tell me where I am going wrong.
Click here to see file data and table structure