Polybase CREATE EXTERNAL TABLE skip header - sql-server

I am new to Azure and Polybase, I am trying to read a CSV file into a SQL External Table.
I noticed that, it is not possible to skip the first row, the header on some forums I read.
I'm hoping for the opposite,Can you help me ?
The code I used is below.
Thanks in advance
CREATE EXTERNAL TABLE dbo.Test2External (
[Guid] [varchar](36) NULL,
[Year] [smallint] NULL,
[SysNum] [bigint] NULL,
[Crc_1] [decimal](15, 2) NULL,
[Crc_2] [decimal](15, 2) NULL,
[Crc_3] [decimal](15, 2) NULL,
[Crc_4] [decimal](15, 2) NULL,
[CreDate] [date] NULL,
[CreTime] [datetime] NULL,
[UpdDate] [date] NULL,
...
WITH (
LOCATION='/20160823/1145/FIN/',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile
);
-- Run a query on the external table
SELECT count(*) FROM dbo.Test2External;

there is a workaround by using 'EXTERNAL FILE FORMAT' with 'FIRST_ROW = 2'.
e.g. if we create a file format
CREATE EXTERNAL FILE FORMAT [CsvFormatWithHeader] WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
FIRST_ROW = 2,
STRING_DELIMITER = '"',
USE_TYPE_DEFAULT = False
)
)
GO
And then use this file format with create external table
CREATE EXTERNAL TABLE [testdata].[testfile1]
(
[column1] [nvarchar](4000) NULL
)
WITH (DATA_SOURCE = data_source,
LOCATION = file_location,
FILE_FORMAT = [CsvFormatWithHeader],REJECT_TYPE = PERCENTAGE,REJECT_VALUE = 100,REJECT_SAMPLE_VALUE = 1000)
It will skip first row while executing queries for 'testdata.testfile1'.

You have a few options:
get the file headers removed permanently because Polybase isn't really meant to work with file headers
Use Azure Data Factory which does have options for skipping header rows when the file is in Blob storage
set the rejection options of the Polybase table to try and ignore the header row, ie setREJECT_TYPE to VALUE and the REJECT_VALUE to 1, eg
this is a bit hacky as you don't have any control over whether or not this is actually the header row, but it would work if you only have one header row and it is the only error in the file. Example below.
For a file called temp.csv with this content:
a,b,c
1,2,3
4,5,6
A command like this will work:
CREATE EXTERNAL TABLE dbo.mycsv (
colA INT NOT NULL,
colB INT NOT NULL,
colC INT NOT NULL
)
WITH (
DATA_SOURCE = eds_esra,
LOCATION = N'/temp.csv',
FILE_FORMAT = eff_csv,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1
)
GO
SELECT *
FROM dbo.mycsv
My results:
set the datatypes of the external table to VARCHAR just for staging the data then remove the header row when converting to an internal table using something like ISNUMERIC, eg
CREATE EXTERNAL TABLE dbo.mycsv2 (
colA VARCHAR(5) NOT NULL,
colB VARCHAR(5) NOT NULL,
colC VARCHAR(5) NOT NULL
)
WITH (
DATA_SOURCE = eds_esra,
LOCATION = N'/temp.csv',
FILE_FORMAT = eff_csv,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
)
GO
CREATE TABLE dbo.mycsv3
WITH (
CLUSTERED INDEX ( colA ),
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
colA,
colB,
colC
FROM dbo.mycsv2
WHERE ISNUMERIC( colA ) = 1
GO
HTH

Skip header rows on SQL Data Warehouse PolyBase load
Delimited text files are often created with a header row that contains the column names. These rows need to be excluded from the data set during the load. Azure SQL Data Warehouse users can now skip these rows by using the First_Row option in the delimited text file format for PolyBase loads.
The First_Row option defines the first row that is read in every file loaded. By setting the value to 2, you effectively skip the header row for all files.
For more information, see the documentation for the CREATE EXTERNAL FILE FORMAT statement.

Related

How to use MERGE-statement with VARBINARY data

I'm stuck trying to figure out how to get one of the MERGE statements to work. See below code snippet:
DECLARE #PipelineRunID VARCHAR(100) = 'testestestestest'
MERGE [TGT].[AW_Production_Culture] as [Target]
USING [SRC].[AW_Production_Culture] as [Source]
ON [Target].[MD5Key] = [Source].[MD5Key]
WHEN MATCHED AND [Target].[MD5Others] != [Source].[MD5Others]
THEN UPDATE SET
[Target].[CultureID] = [Source].[CultureID]
,[Target].[ModifiedDate] = [Source].[ModifiedDate]
,[Target].[Name] = [Source].[Name]
,[Target].[MD5Others] = [Source].[MD5Others]
,[Target].[PipelineRunID] = #PipelineRunID
WHEN NOT MATCHED BY TARGET THEN
INSERT VALUES (
[Source].[AW_Production_CultureKey]
,[Source].[CultureID]
,[Source].[ModifiedDate]
,[Source].[Name]
,#PipelineRunID
,[Source].[MD5Key]
,[Source].[MD5Others]);
When I try and run this query I receive the following error:
Msg 257, Level 16, State 3, Line 16
Implicit conversion from data type varchar to varbinary is not allowed. Use the CONVERT function to run this query.
The only VARBINARY column types are MD5Key and MD5Others. As they are both linked to their corresponding columns I don't understand why my error message indicates there is a VARCHAR problem involved. Does anybody understand how and why I should use a CONVERT() function here?
Thanks!
--EDIT: Schema definitions
CREATE VIEW [SRC].[AW_Production_Culture]
WITH SCHEMABINDING
as
SELECT
CAST(CONCAT('',[CultureID]) as VARCHAR(100)) as [AW_Production_CultureKey]
,CAST(HASHBYTES('MD5',CONCAT('',[CultureID])) as VARBINARY(16)) as [MD5Key]
,CAST(HASHBYTES('MD5',CONCAT([ModifiedDate],'|',[Name])) as VARBINARY(16)) as [MD5Others]
,[CultureID],[ModifiedDate],[Name]
FROM
[SRC].[tbl_AW_Production_Culture]
CREATE TABLE [TGT].[AW_Production_Culture](
[AW_Production_CultureKey] [varchar](100) NOT NULL,
[CultureID] [nchar](6) NULL,
[ModifiedDate] [datetime] NULL,
[Name] [nvarchar](50) NULL,
[MD5Key] [varbinary](16) NOT NULL,
[MD5Others] [varbinary](16) NOT NULL,
[RecordValidFrom] [datetime2](7) GENERATED ALWAYS AS ROW START NOT NULL,
[RecordValidUntil] [datetime2](7) GENERATED ALWAYS AS ROW END NOT NULL,
[PipelineRunID] [varchar](36) NOT NULL,
PRIMARY KEY CLUSTERED
(
[MD5Key] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY],
PERIOD FOR SYSTEM_TIME ([RecordValidFrom], [RecordValidUntil])
) ON [PRIMARY]
WITH
(
SYSTEM_VERSIONING = ON ( HISTORY_TABLE = [TGT].[AW_Production_Culture_History] )
)
Reposting my comment as an answer for the sweet, sweet, internet points:
You're getting that error because your varbinary value is being inserted into a varchar column. As your columns have the correct types already then it means your INSERT clause has mismatched columns.
As it is, your MERGE statement is not explicitly listing the destination columns - you should always explicitly list columns in production code so that your DML queries won't break if columns are added or reordered or marked HIDDEN.
So to fix this, change your INSERT clause to explicitly list destination column names.
Also, when using MERGE you should use HOLDLOCK (Or a more suitable lock, if applicable) - otherwise you’ll run into concurrency issues. MERGE is not concurrency-safe by default!
Minor nit-picks that are largely subjective:
I personally prefer avoiding [escapedName] wherever possible and prefer using short table aliases.
e.g. use s and t instead of [Source] and [Target].
"Id" (for "identity" or "identifier") is an abbreviation, not an acronym - so it should be cased as Id and not ID.
Consider using an OUTPUT clause to help diagnose/debug issues too.
So I'd write it like so:
DECLARE #PipelineRunId VARCHAR(100) = 'testestestestest'
MERGE INTO
tgt.AW_Production_Culture WITH (HOLDLOCK) AS t
USING
src.AW_Production_Culture AS s ON t.MD5Key = s.MD5Key
WHEN MATCHED AND t.MD5Others != s.MD5Others THEN UPDATE SET
t.CultureId = s.CultureId,
t.ModifiedDate = s.ModifiedDate,
t.Name = s.Name,
t.MD5Others = s.MD5Others,
t.PipelineRunID = #PipelineRunId
WHEN NOT MATCHED BY TARGET THEN INSERT
(
AW_Production_CultureKey,
CultureId,
ModifiedDate,
[Name],
PipelineRunId,
MD5Key,
MD5Others
)
VALUES
(
s.AW_Production_CultureKey,
s.CultureId,
s.ModifiedDate,
s.[Name],
#PipelineRunId,
s.MD5Key,
s.MD5Others
)
OUTPUT
$action AS [Action],
inserted.*,
deleted.*;

Create table using text file bulk insert

I'm trying to create a table in SQL Server from a text file using bulk insert but I keep getting bulk load data conversion error(truncation). Is there something I'm doing wrong? The top part is how the data is in the text and below is the code.
'01','INPATIENT FACILITY','010','ACUTE CARE HOSPITAL'
'01','INPATIENT FACILITY','011','PRIVATE PSYCHIATRIC HOSPITAL'
'01','INPATIENT FACILITY','012','INPATIENT MEDICAL REHAB HOSPITAL'
CREATE TABLE [dbo].[PROVIDER_TYPE]
(
[PROVIDER_TYPE_ID] [VARCHAR](2) NULL,
[PROVIDER_TYPE] [VARCHAR](50) NULL,
[PROVIDER_SPECIALITY_ID] [VARCHAR](3) NULL,
[PROVIDER_SPECIALITY] [VARCHAR](50) NULL
) ON [PRIMARY]
BULK INSERT DBO.PROVIDER_TYPE FROM 'C:\SQL\t2.txt'
WITH (
datafiletype = 'char'
,fieldterminator = ','
,ROWTERMINATOR = '\n'
)
The first value isn't 2 characters long, it's 4. The value is '01'; that's inclusive of the single quotes ('). This is why you're getting a truncation error, as '01' ('''01''' if you were to want to represent the string in T-SQL) doesn't fit in a varchar(2).
If you're on SQL Server 2017+ you can use the FORMAT and FIELDQUOTE options. Note that I also use \r\n for ROWTERMINATOR, as I had both when I created the file, if yours only contains a line break (and no carriage return), then just use \n:
BULK INSERT dbo.PROVIDER_TYPE FROM '/mnt/WDBlue/t2.txt' --This was my test file
WITH (DATAFILETYPE = 'char',
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\r\n',
FORMAT = 'CSV',
FIELDQUOTE = '''');
If you aren't using SQL Server 2017+, then it simply does not support quoted fields, and I suggest using a different tool.

Azure SQL DW External File Format treat empty strings as NULL using Polybase

I'm using external tables to load data from csv stored in a blob to a table in Azure SQL Data Warehouse. The csv uses a string delimiter (double quote), empty strings are represented as 2 double quotes ("").
I want the empty columns to be treated as NULL in the table. The external file format I use is set up with USE_TYPE_DEFAULT = FALSE, but this does not seem to work since empty columns are imported as empty strings. And this only tends to happen when the columns are strings, numeric columns are correctly converted to NULL.
I'm also importing a different csv which does not have a string delimiter using a different external file format and those empty columns are imported as NULL. So it looks like it has something to do with the STRING_DELIMITER option.
The csv:
col1;col2;col3;col4;col5;col6
"a";"b";"c";"1";"2";"3"
"d";"";"f";"4";"";"6"
The code of the external file format:
CREATE EXTERNAL FILE FORMAT eff_string_del
WITH (
FORMAT_TYPE = DELIMITEDTEXT
,FORMAT_OPTIONS(
FIELD_TERMINATOR = ';'
,STRING_DELIMITER = '0x22'
,FIRST_ROW = 2
,USE_TYPE_DEFAULT = False)
)
Code of the table using the external file format:
CREATE EXTERNAL TABLE dbo.test (
col1 varchar(1) null
,col2 varchar(1) null
,col3 varchar(1) null
,col4 int null
,col5 int null
,col6 int null
)
WITH (
DATA_SOURCE = [EDS]
,LOCATION = N'test.csv'
,FILE_FORMAT = eff_string_del
,REJECT_TYPE = VALUE
,REJECT_VALUE = 0
)
The result when querying the external table:
SELECT *
FROM [dbo].[test]
col1 col2 col3 col4 col5 col6
---- ---- ---- ----------- ----------- -----------
a b c 1 2 3
d f 4 NULL 6
Can someone please help me explain what is happening or what I'm doing wrong?
Use USE_TYPE_DEFAULT = False in external file format.
Any NULL values that are stored by using the word NULL in the delimited text file are imported as the string 'NULL'.
For example:
CREATE EXTERNAL FILE FORMAT example_file_format
WITH (FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
FIRST_ROW = 2,
USE_TYPE_DEFAULT = False)
)
Reference : https://learn.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql?view=sql-server-2017
Have you considered adding the value NULL in that field instead of ""?
See below a test I've performed using the following code:
declare #mytable table
(id int identity primary key, column1 varchar(100))
insert into #mytable (column1) values ('test1')
insert into #mytable (column1) values ('test2')
insert into #mytable (column1) values (null)
insert into #mytable (column1) values ('test3')
insert into #mytable (column1) values (null)
select
*
from #mytable
The results looks like this:
Would this work for you?

How can I do a merge of two tables in SQL Server?

I have two tables with schemas like this:
CREATE TABLE [dbo].[WordsA] (
[WordId] INT IDENTITY (1, 1) NOT NULL,
[Word] NVARCHAR (MAX) NOT NULL
[FromWordA] BIT NULL,
[FromWordB] BIT NULL
);
CREATE TABLE [dbo].[WordsB] (
[WordId] INT IDENTITY (1, 1) NOT NULL,
[Word] NVARCHAR (MAX) NOT NULL
);
How can I take the contents of table WordsB and insert into WordsA row by row:
If Word does not exist in WordsA
Insert into WordsA and set FromWordB = 1
If Word exists in WordsA
Update WordsA setting FromWordB = 1
You need MERGE:
MERGE [dbo].[WordsA] as target
USING [dbo].[WordsB] as source
ON target.[Word] = source.[Word]
WHEN MATCHED THEN
UPDATE SET [FromWordB] = 1
WHEN NOT MATCHED THEN
INSERT ([Word],[FromWordA],[FromWordb])
VALUES (source.[Word],0,1);
Give this a try (not tested):
MERGE WordsA A
USING WordsB B
ON A.WordId = B.WordID
WHEN NOT MATCHED BY SOURCE THEN
INSERT (Word, FromWordB)
VALUES (B.Word, 1)
WHEN MATCHED THEN
UPDATE SET FromWordB = 1
;

I want to merge two tables without any primary keys

We cannot create any additional column
Please keep that in mind
The whole intention of this script is to merge data into my temp table
when the data is matching don't have to do any thing.
if some data is present in #temp_cqm_class_template_xref and not in cqm_class_template_xref_temp then those data's has to be deleted from #temp_cqm_class_template_xref table
if it is the other way it has to be inserted into the #temp_cqm_class_template_xref table
IF OBJECT_ID('tempdb..#temp_cqm_class_template_xref') IS NOT NULL
DROP TABLE #temp_cqm_class_template_xref;
CREATE TABLE #temp_cqm_class_template_xref (
[template_name] [VARCHAR](30) NOT NULL
,[measure_id] [INT] NOT NULL
,[cqm_item_mstr_id] [INT] NOT NULL
,[created_by] [INT] NOT NULL
,[modified_by] [INT] NOT NULL
,[create_timestamp] [DATETIME] NOT NULL
,[modify_timestamp] [DATETIME] NOT NULL
);
MERGE INTO #temp_cqm_class_template_xref AS t
USING cqm_class_template_xref_temp AS s
ON (
t.template_name = s.template_name
AND t.measure_id = s.measure_id
AND t.cqm_item_mstr_id = s.cqm_item_mstr_id
)
WHEN NOT MATCHED
THEN
INSERT (
template_name
,measure_id
,cqm_item_mstr_id
,created_by
,modified_by
,create_timestamp
,modify_timestamp
)
VALUES (
s.template_name
,s.measure_id
,s.cqm_item_mstr_id
,s.created_by
,s.modified_by
,s.create_timestamp
,s.modify_timestamp
)
WHEN NOT MATCHED BY target
THEN
DELETE;
When I run this script, I get the following error:
Msg 10711, Level 15, State 1, Procedure
ngdev_cqm_class_template_xref_bcp_upld, Line 88 An action of type
'INSERT' is not allowed in the 'WHEN MATCHED' clause of a MERGE
statement
Merge is not a good technique to use. see:
https://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/
Merge is hard to debug and very hard to maintain later when you have problems with the data that it is trying to merge. Don't ever use it.
Instead write an insert using a select instead of a values clause and write a delete.
You may try something like left join and figure out what columns are null
in the opposite table;
According to your logic This might help you
update t
set t.template_name=null,
t.measure_id = null,
t.cqm_item_mstr_id = null
#temp_cqm_class_template_xref t Left join cqm_class_template_xref_temp s on
t.template_name = s.template_name
AND t.measure_id = s.measure_id
AND t.cqm_item_mstr_id = s.cqm_item_mstr_id
delete from #temp_cqm_class_template_xref
where measure_id is null and cqm_item_mstr_id is null and and template_name is null

Resources