Polybase to connect local CSV file - sql-server

I'm unable to access a local CSV file from SQL Server 2019 Polybase. This is a simple 3-columned text file. I have also created a local system DSN (from ODBC32 UI).
I got the sample code from here. However, the driver in the link (cdata) is not free. Any assistance in solving this issue will be greatly appreciated.
create master key encryption by password = 'Polybase2CSV';
create database scoped credential csv_creds
with identity = 'username', secret = 'password';
create external data source csv_source
with (
location = 'odbc://localhost',
connection_options = 'DSN=CustomerDSN', -- this is the DSN name
-- PUSHDOWN = ON | OFF,
credential = csv_creds
);
CREATE EXTERNAL TABLE Customer
(
CUSTOMERID int,
CUSTOMERNAME varchar(250),
DEPARTMENT varchar(250)
) WITH (
LOCATION='customer.txt',
DATA_SOURCE=csv_source
);

This requires a few steps to make it work successfully. As prerequisites, you'll need to make sure SQL Server 2019 has been updated to CU4 (KB4548597) to fix a few known issues. For a free solution, you will need to install the 64-bit version Microsoft Access Database Engine 2016 Redistributable. This will install the 64-bit version of the ODBC drivers.
With these two things in place, you can now create the external data source. I recommend disabling PUSHDOWN. I've seen it cause some problems with this particular driver.
If you want to directly connect to the CSV file that contains a header row, you can create the external data source by simply specifying the Access Text Driver and the folder that will contain the files:
CREATE EXTERNAL DATA SOURCE MyODBC
WITH
(
LOCATION = 'odbc://localhost',
CONNECTION_OPTIONS = 'Driver=Microsoft Access Text Driver (*.txt, *.csv);Dbq=F:\data\files\',
PUSHDOWN = OFF
);
To use the data source, you need to create an external table definition that reflects the file format. The LOCATION parameter will be the name of the file to load. You can wrap the file name and driver name in braces to avoid issues with special characters. It's important to make sure the column names you define for this table match the names in the header row. Because you're using CU4, if a data type doesn't match the driver's expectations, you'll get an error indicating which data types were expected.
CREATE EXTERNAL TABLE dbo.CsvData
(
Name nvarchar(128),
Count int,
Description nvarchar(255)
)
WITH
(
LOCATION='[filename.csv]',
DATA_SOURCE = [MyODBC]
)
If you want to define the column names, data types, etc., in the ODBC Data Sources (64-bit ) UI, choose the Microsoft Access Text Driver. You can then select the folder, file types, and definition of the text file format. Make sure to use the 64-bit data sources. Once you're done defining the format details, you'll see a schema.ini file is created in the folder which contains those details.
For the external data source, you'll specify the name of the DSN:
CREATE EXTERNAL DATA SOURCE MyODBC
WITH
(
LOCATION = 'odbc://localhost',
CONNECTION_OPTIONS = 'DSN=LocalCSV',
PUSHDOWN = OFF
);
The EXTERNAL TABLE is created the same way as before, with the column names and data types matching the definition you declared in the DSN.

To create a data source directly you'll need to buy that driver. That is option 1, but since it out the window. You have two more options. Import that data directly to SQL Server or if you really want to use PolyBase. Load that Data into Staging SQL Tables then create External Tables referencing that staging table.
My assumptions: CSV data isn't stale. structure/schema will remain constant.
Create a staging table. Use
Import-DbaCSV -Path "D:\CustomerTest\Customer.csv"`
-SqlInstance ServerName`
-Database DBName`
-Table "Customer"
Then recycle your code to either connect to PolyBase or Use the data directly.
CREATE DATABASE SCOPED CREDENTIAL csv_creds
WITH IDENTITY = 'username', SECRET = 'password';
CREATE EXTERNAL DATA SOURCE csv_source
WITH ( LOCATION = 'sqlserver://SERVERNAME:PORTNUMBER',
PUSHDOWN = ON,
CREDENTIAL = csv_creds);
You can then periodically run the PS function to load data into the table as needed.

Related

Azure Data Factory: Return Identifier values from a Copy Data activity

I am updating an on-premises SQL Server database table with data from a csv file using a Copy Data activity. There is an int identity Id column on the sink table that gets generated when I do the Upsert. I would like to retrieve the Id value generated in that table to use later in the pipeline.
Is there a way to do this?
I can't use a data flow as I am using a self-hosted Integration Runtime.
Hi #Nick.McDermaid, I am loading about 7,000 rows from the file to the database. I want to store the identities in the database the file comes from.
Edit:
I have 2 databases (source/target). I want to upsert (using MERGE SQL below, with the OUTPUT clause) into the target db from the source db and then return the Ids (via the OUTPUT resultset) to the source db. The problem I have is that the upsert (MERGE) SQL gets it's SELECT statement from the same target db that the target table is in (when using a Copy Data activity), but I need to get the SELECT from the source db. Is there a way to do this, maybe using the Script activity?
Edit 2: To clarify, the 2 databases are on different servers.
Edit 3 (MERGE Update):
MERGE Product AS target
USING (SELECT [epProductDescription]
,[epProductPrimaryReference]
FROM [epProduct]
WHERE [epEndpointId] = '438E5150-8B7C-493C-9E79-AF4E990DEA04') AS source
ON target.[Sku] = source.[epProductPrimaryReference]
WHEN MATCHED THEN
UPDATE SET [Name] = source.[epProductDescription]
,[Sku] = source.[epProductPrimaryReference]
WHEN NOT MATCHED THEN
INSERT ([Name]
,[Sku]
VALUES (source.[epProductDescription]
,source.[epProductPrimaryReference]
OUTPUT $action, inserted.*, updated.*;
Edit 3 (sample data):
source sample:
target output
Is there a way to do this, maybe using the Script activity?
Yes, you can execute this script using Script activity in ADF
As your tables are on different SQL servers first you have to create Linked server with source database on target Database.
go to >> Server Objects >> Linked Server >> New Linked server and create linked server with source database on target Database as below.
While creating linked server make sure same user must exist on both databases.
then I wrote Merge Query using this linked sever source.
My Sample Query:
MERGE INTO PersonsTarget as trg
USING (SELECT [LastName],[FirstName],[State]
FROM [OP3].[sample1].[dbo].[Personssource]) AS src
ON trg.[State] = src.[State]
WHEN MATCHED THEN
UPDATE SET [LastName] = src.[LastName]
,[FirstName] = src.[FirstName]
WHEN NOT MATCHED THEN
INSERT ([LastName],[FirstName],[State])
VALUES (src.[FirstName],src.[LastName],src.[State])
OUTPUT $action, inserted.*;
Then In Script activity I provided the script
Note: In linked service for on premises target table use same user which you used in linked service
Executed successfully and returning Ids:

SQL Server Polybase and Oracle Exadata HA/DR connection strings

Working with SQL Server 2019 Enterprise CU18 Polybase (On-premise). I have a working connection from SQL Server to Oracle that is defined as follows:
CREATE EXTERNAL DATA SOURCE [OracleDataSource]
WITH (LOCATION = 'oracle://ExaData1:1521', CREDENTIAL = [OracleCredential])
However, we have an HA/DR pair for our exadata system. ExaData1 and ExaData2. If we did this in a linked server, the connetion would look like this:
(DESCRIPTION=(ENABLE=BROKEN)(ADDRESS_LIST=(LOAD_BALANCE=YES)(FAILOVER=YES)(ADDRESS=(PROTOCOL=tcp)(HOST=ExaData1.domain.com)(PORT=1521))(ADDRESS=(PROTOCOL=tcp)(HOST=ExaData2.domain.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=OracleTNSName)))
Yet, I can't seems to define it this way in Polybase (at least I haven't figure out how to yet). I need to figure out how to list both possible servers in the External Data Source so when the DB fails over, it will follow it naturally.
The other option is to figure out how to use the TNS Name, rather than the server name for the Location, but I've not figured that out either. Currently, when I define a table, it looks like this, which is based upon the previously defined data source, listing the TNS name as part of the DATA_SOURCE properties:
CREATE EXTERNAL TABLE [Polybase].[sample]
(
[SID] [nvarchar](8) NULL,
[MESSAGE_DATE] [datetime2](7) NULL,
[MESSAGE_ID] [nvarchar](3) NULL
)
WITH (DATA_SOURCE = [OracleDataSource],LOCATION = N'[OracleTNSNAME.domain.com].OracleSchema.SAMPLE')
Anyone have suggestions or options as I'm not finding anything in the MS documenation.
Appreciate any and all support.
Have you tried with this?
oracle//server1:1521; AlternateServers=(server2:1521,server3:1521,server4:1521); LoadBalancing=true
The only other option to connect is by using ExaData SCAN name (configured by Oracle database administrator, which resolves to any of the nodes using a single name). From Oracle it looks like this:
(DESCRIPTION =
     (CONNECT_TIMEOUT=90) (RETRY_COUNT=20)(RETRY_DELAY=3) (TRANSPORT_CONNECT_TIMEOUT=3)
                ( ADDRESS = (PROTOCOL = TCP)(HOST=sales-scan.mycluster.example.com)(PORT=1521))
                          (CONNECT_DATA=(SERVICE_NAME=oltp.example.com)))
Then from SQL Server you connect to sales-scan.mycluster.example.com

Easy way to load a CSV file from the command line into a new table of an Oracle database without specifying the column details

I often want to quickly load a CSV into an Oracle database. The CSV (Unicode) is on a machine with an Oracle InstantClient version 19.5, the Oracle database is of version 18c.
I look for a command line tool which uploads the rows without me specifying a column structure.
I know I can use sqlldr with a .ctl file, but then I need to define columns types, etc. I am interested in a tool which figures out the column attributes itself from the data in the CSV (or uses a generic default for all columns).
The CSVs I have to ingest contain always a header row the tool in question could use to determine appropriate columns in the table.
Starting with Oracle 12c, you can use sqlldr in express mode, thereby you don't need any control file.
In Oracle Database 12c onwards, SQLLoader has a new feature called
express mode that makes loading CSV files faster and easier. With
express mode, there is no need to write a control file for most CSV
files you load. Instead, you can load the CSV file with just a few
parameters on the SQLLoader command line.
An example
Imagine I have a table like this
CREATE TABLE EMP
(EMPNO number(4) not null,
ENAME varchar2(10),
HIREDATE date,
DEPTNO number(2));
Then a csv file that looks like this
7782,Clark,09-Jun-81,10
7839,King,17-Nov-81,12
I can use sqlldr in express mode :
sqlldr userid=xxx table=emp
You can read more about express mode in this white paper
Express Mode in SQLLDR
Forget about using sqlldr in a script file. Your best bet is on using an external table. This is a create table statement with sqlldr commands that will read a file from a directory and store it as a table. Super easy, really convenient.
Here is an example:
create table thisTable (
"field1" varchar2(10)
,"field2" varchar2(100)
,"field3" varchar2(100)
,"dateField" date
) organization external (
type oracle_loader
default directory <createDirectoryWithYourPath>
access parameters (
records delimited by newline
load when (fieldname != BLANK)
skip 9
fields terminated by ',' optionally ENCLOSED BY '"' ltrim
missing field values are null
(
"field1"
,"field2"
,"field3"
,"dateField" date 'mm/dd/yyyy'
)
)
location ('filename.csv')
);

Can I make sql server management studio not require [] around table names?

If I copy and paste a query into sql management studio to debug it, I have to change all the table names from tableName to [database].[dbo].[tableName], Can this be avoided?
It also matters which database you are using. When you open a default query window, it selects Master as your database. You can either manually change it to your database, which will just accept table names after that or you can specify in your query with a
Use Databasename;
Otherwise, you will need to specify the databasename.schema.table name with every reference. This is also how you can query across multiple databases in the same query window.
[] is called QuoteName and is required when you don't have a valid identifier for an object..
For Example
this fails
create table dbo.123
(
id int
)
this succeeds
create table dbo.[123]
(
id int
)
So in summary,[] isnot necessary,if you have a valid identifier and is required when you dont have one

How to pass an Excel file from a WinForms client to a WCF service and into an SQL Server table?

How to pass an Excel file from a WinForms client to a WCF service and into an SQL Server table?
Can anyone provide any guidance, code, or advice?
WCF service contract and implementation that will take an Excel file as a parameter
Contract implementation should insert that Excel file into a varbinary(MAX) column in SQL Server.
Here is a post that addresses the WCF portion of your question.
Once you get the file to the server you can use FileHelpers.net to parse that file into an object.
I don't care what your requirement says, do not insert an excel document into a sql server varbinary(max) column. The requirement should say "Upload an Excel document, insert its contents into a database. However, we need to relate the original excel file to the data within the database so we can repudiate any claims that our process failed as well as have a verification mechanism."
Create another table called EXCEL_IMPORT or something that has more
or less the following columns
Check the extended property I put on there for column clarifications
create table EXCEL_IMPORT
(
ImportID int identity(1,1) NOT NULL CONSTRAINT [PK_EXCEL_IMPORT] PRIMARY KEY,
FileName_Incoming varchar(max),
FilePath_Internal varchar(max),
FileName_Internal varchar(max),
FileRowCount int NOT NULL CONSTRAINT [CK_EXCEL_IMPORT_FileRowCount] CHECK (FileRowCount >= 0),
ImportDate datetime NOT NULL CONSTRAINT [DF_EXCEL_IMPORT_ImportDate] DEFAULT(getdate())
)
EXEC sys.sp_addextendedproperty #name=N'MS_Description', #value=N'The location on the client computer i.e. C:\Users\jimmy\Desktop\iHeartExcel.xls' , #level0type=N'SCHEMA',#level0name=N'dbo', #level1type=N'TABLE',#level1name=N'EXCEL_IMPORT', #level2type=N'COLUMN',#level2name=N'FileName_Incoming'
EXEC sys.sp_addextendedproperty #name=N'MS_Description', #value=N'The folder on your fileshare the file is in (this is incase you decide to change the fileshare name)' , #level0type=N'SCHEMA',#level0name=N'dbo', #level1type=N'TABLE',#level1name=N'EXCEL_IMPORT', #level2type=N'COLUMN',#level2name=N'FilePath_Internal'
EXEC sys.sp_addextendedproperty #name=N'MS_Description', #value=N'The unique filename that you decided on in the fileshare i.e. 2012_04_20_11_34_59_0_71f452e7-7cac-4afe-b145-6b7557f34263.xls' , #level0type=N'SCHEMA',#level0name=N'dbo', #level1type=N'TABLE',#level1name=N'EXCEL_IMPORT', #level2type=N'COLUMN',#level2name=N'FileName_Internal'
Next, write a process that copy's the excel file to a location on your fileshare, and creates a unique filename. At this point you have an object that represents the file, the file itself and all the information about where you are putting the file.
Create a table that mimicks the columns of the excel spreadsheet and add an ImportID on the end of it that references back to the excel_import table defined above as well as an identity primary key.
After that, write a process that inserts the objects into the database with the specified relationships. This will vary based on your current setup.
Finally you should have a keyed table with all the excel data in it that references a row in an import table that points to a file on disk.
Some other thoughts
I would think about not allowing the excel data within the table to
be modified. Copy it in a calculated form to another table.
Sometimes this doesn't make since because of the volume of data but what you are doing here is building a provable chain of data back to the original source and sometimes it makes sense to have an untainted file copy as well as sql copy.
The first response your going to have is "But all the excel files are different!" If that is the case, you just create the import table that points to a file on disk (Assuming they are supposed to be different). If they are supposed to be the same, you just need more error checking.
Saving the binary file within the database is going to have consequences. Namely the backup size and that sql can't really index those kinds of columns. Your traversals of that table will get also get slower with every file insert and as a general rule you don't want that. (You can't do any more or less with the file on disk than you can with the binary file)
Use a GUID with a prepended date in your filename on the share. You will never hunt through those files anyway and if you need to do it by date you can use the filename. This makes the names globally unique incase other processes need to write here as well.
I know this isn't what you asked for, but I have been down this path before with terrible consequences. I've imported millions of files with the method I described above and had no major process issues.
Speak up when requirements are unfeasible. Suggest alternatives that do the same thing cheaper or easier (use words like testable/scalable).
I'm sure the experts out there can improve upon this, but here are the basics ...
On the Server
1a. Add a new OperationContract to your Interface (eg. IService.cs)
[OperationContract]
string UploadBinaryFile(byte[] aByteArray);
1b. Insert into SQL Server table, in your contract Implementation (eg. Service.cs)
public string UploadBinaryFile(byte[] aByteArray)
{
try
{
SqlConnection conn = new SqlConnection();
conn.ConnectionString = MyConnectionString; // from saved connection string
conn.Open();
using (SqlCommand cmd = new SqlCommand("INSERT INTO MyTestTable(BinaryFile) VALUES (#binaryfile)", conn))
{
cmd.Parameters.Add("#binaryfile", SqlDbType.VarBinary, -1).Value = aByteArray;
cmd.ExecuteNonQuery();
}
return "1"; // to me, this indicates success
}
catch (Exception ex)
{
return "0: " + ex.Message; // return an error indicator + ex.message
}
}
On the Client
2a. Use the OpenFileDialog component to browse for files on your filesystem using the standard dialogue box that's used by most Windows applications.
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
txtUploadFilePath.Text = openFileDialog1.FileName;
}
2b. Load the file's contents into a byte array
var byte[] BinaryFile = System.IO.File.ReadAllBytes(txtUploadFilePath.Text);
2c. Call your WCF contract, passing in the byte array
string UploadResponse = client.UploadBinaryFile(BinaryFile);
It's working... YAY :-)

Resources