Polybase unable to place a WHERE condition on XLSX destination - sql-server

The WHERE condition in my T-SQL query is not returning any rows, from SQL 2019 polybase to .XLSX file
Here's my code that created Polybase:
create master key encryption by password = 'Polybase2CSV';
create external data source myODBCxlsx with
(
LOCATION = 'odbc://localhost',
CONNECTION_OPTIONS = 'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}; DBQ=F:\PolybaseSourceData\CustomerData.xlsx'
);
create external table CustomerData(
CUSTOMERID FLOAT(53),
CUSTOMERNAME Nvarchar(255),
DEPARTMENT Nvarchar(255)
) with (
LOCATION='[sheet1$]',
DATA_SOURCE=myODBCxlsx
);
This query works:
select * from customerData
However this doesn't:
select * from customerData where customername = 'Steve'
The query doesn't return any rows, although there's a customer by name Steve.

PUSHDOWN is automatically enabled by default if you don't specify a setting when creating an external data source. Unfortunately, that particular driver doesn't work with PUSHDOWN enabled, so you will get errors with simple queries. Turning off PUSHDOWN will resolve that.
The external data source definition should look like this:
create external data source myODBCxlsx with
(
LOCATION = 'odbc://localhost',
CONNECTION_OPTIONS = 'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}; DBQ=F:\Files\CustomerData.xlsx',
PUSHDOWN=OFF
);

Related

Getting this error : 'The Remote Java Bridge has not been attached yet.' while connecting to hdfs from external table in sql server

I tried to create a external table in sql server pointing to hdfs, but getting the below error
Msg 110813, Level 16, State 1, Line 16
105019;External file access failed due to internal error: 'The Remote Java Bridge has not been attached yet.'
enter image description here
I have configured Hadoop and SLQ Server on Ubuntu-20.04 and installed polybase as well.
-> SQL Server version - 2019
-> hadoop-3.3.0
Below are the queries I have executed.
CREATE EXTERNAL DATA SOURCE [HadoopDFS1]
WITH (
TYPE = Hadoop,
LOCATION = N'hdfs://localhost:9000'
)
CREATE EXTERNAL FILE FORMAT CSVFF WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR =',',
USE_TYPE_DEFAULT = TRUE));
CREATE EXTERNAL TABLE [dbo].[Salary] (
[Company Name] nvarchar(200),
[Job Title] nvarchar(100),
[Salaries Reported] int,
[Location] nvarchar(50)
)
WITH (LOCATION='/Data/input/',
DATA_SOURCE = HadoopDFS,
FILE_FORMAT = CSVFF
);

SQL Data Warehouse External Table with String fields

I am unable to find a way to create an external table in Azure SQL Data Warehouse (Synapse SQL Pool) with Polybase where some fields contain embedded commas.
For a csv file with 4 columns as below:
myresourcename,
myresourcelocation,
"""resourceVersion"": ""windows"",""deployedBy"": ""john"",""project_name"": ""test_project""",
"{ ""ResourceType"": ""Network"", ""programName"": ""v1""}"
Tried with the following Create External Table statements.
CREATE EXTERNAL FILE FORMAT my_format
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR=',',
STRING_DELIMITER='"',
First_Row = 2
)
);
CREATE EXTERNAL TABLE my_external_table
(
resourceName VARCHAR,
resourceLocation VARCHAR,
resourceTags VARCHAR,
resourceDetails VARCHAR
)
WITH (
LOCATION = 'my/location/',
DATA_SOURCE = my_source,
FILE_FORMAT = my_format
)
But querying this table gives the following error:
Failed to execute query. Error: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Too many columns in the line.
Any help will be appreciated.
Currently this is not supported in polybase, need to modify the input data accordingly to get it working.

How to access data of one DB from another using Elastic Job?

I am trying to access data from one DB to another DB.For that I am using Elastic Job.Using Elastic Job I am to create table from one DB to another.But not able to access data or transfer data.I tried it using External Data source and External Table.
I used the below code :
External Data Source
CREATE EXTERNAL DATA SOURCE RemoteReferenceData
WITH
(
TYPE=RDBMS,
LOCATION='myserver',
DATABASE_NAME='dbname',
CREDENTIAL= JobRun
);
CREATE EXTERNAL TABLE [tablename] (
[Id] int null,
[Name] nvarchar(max) null
)
WITH (
DATA_SOURCE = RemoteReferenceData,
SCHEMA_NAME = N'dbo',
OBJECT_NAME = N'mytablename'
);
Getting error below:
> Error retrieving data from server.dbname. The underlying error
> message received was: 'The server principal "JobUser" is not able to
> access the database "dbname" under the current security context.
> Cannot open database "dbname" requested by the login. The login
> failed. Login failed for user 'JobUser'.
There are some errors in you statements:
the LOCATION value should be: LOCATION='[servername].database.windows.net'
Make sure when you create the CREDENTIAL: The "username" and "password" should be the username and password used to log in into the Customers database. Authentication using Azure Active Directory with elastic queries is not currently supported.
The whole T-SQL code example should be like this:
CREATE DATABASE SCOPED CREDENTIAL ElasticDBQueryCred
WITH IDENTITY = 'Username',
SECRET = 'Password';
CREATE EXTERNAL DATA SOURCE MyElasticDBQueryDataSrc WITH
(TYPE = RDBMS,
LOCATION = '[servername].database.windows.net',
DATABASE_NAME = 'Mydatabase',
CREDENTIAL = ElasticDBQueryCred,
) ;
CREATE EXTERNAL TABLE [dbo].[CustomerInformation]
( [CustomerID] [int] NOT NULL,
[CustomerName] [varchar](50) NOT NULL,
[Company] [varchar](50) NOT NULL)
WITH
( DATA_SOURCE = MyElasticDBQueryDataSrc)
SELECT * FROM CustomerInformation
I using the code to query the table in Mydatabase from DB1:
For more details, ref here: Get started with cross-database queries (vertical partitioning) (preview)
Hope this helps.

read file from Azure Blob Storage into Azure SQL Database

I have already tested this design using a local SQL Server Express set-up.
I uploaded several .json files to Azure Storage
In SQL Database, I created an External Data source:
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
(TYPE = BLOB_STORAGE,
LOCATION = 'https://mydatafilestest.blob.core.windows.net/my_dir
);
Then I tried to query the file using my External Data Source:
select *
from OPENROWSET
(BULK 'my_test_doc.json', DATA_SOURCE = 'MyAzureStorage', SINGLE_CLOB) as data
However, this failed with the error message "Cannot bulk load. The file "prod_EnvBlow.json" does not exist or you don't have file access rights."
Do I need to configure a DATABASE SCOPED CREDENTIAL to access the file storage, as described here?
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql
What else can anyone see that has gone wrong and I need to correct?
OPENROWSET is currently not supported on Azure SQL Database as explained in this documentation page. You may use BULK INSERT to insert data into a temporary table and then query this table. See this page for documentation on BULK INSERT.
Now that OPENROWSET is in public preview, the following works. Nb the key option is in case your blob is not public. I tried it on a private blob with the scoped credential option and it worked. nnb if you are using a SAS key make sure you delete the leading ? so the string should start with sv as shown below.
Make sure the blobcontainer/my_test_doc.json section specifies the correct path e.g. container/file.
CREATE DATABASE SCOPED CREDENTIAL MyAzureBlobStorageCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'sv=2017****************';
CREATE EXTERNAL DATA SOURCE MyAzureBlobStorage
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://yourstorage.blob.core.windows.net',
CREDENTIAL= MyAzureBlobStorageCredential);
DECLARE #json varchar(max)
SELECT #json = BulkColumn FROM OPENROWSET(BULK 'blobcontainer/my_test_doc.json',
SINGLE_BLOB, DATA_SOURCE = 'MyAzureBlobStorage',
FORMATFILE_DATA_SOURCE = 'MyAzureBlobStorage') as j;
select #json;
More detail provided in these docs

Send #Output table results to a flat file in SSIS package returning 0 rows

Hi and thank you for your help.
I have an SSIS package the first step of which executes a sql delete query to delete rows in a table and send the rows it deleted to #output table. The next step tries to take the #output table and send it to a flat file destination. When I ran the delete query in sql server mgmt. studio it successfully output the rows it deleted but for some reason the flat file in the package ends up with 0 rows. Is there something I need to do to make the #output table data accessible in the subsequent flat file destination component? Do I need to create a temp table instead?
Here is the query to output deleted rows in the table #output. I'd like to take the contents of the #output table and send them to a flat file destination.
DECLARE #Output table
(PatientVisitID INT`
,VisitNumber NVARCHAR(45)`
,LastName NVARCHAR(45)`
,FirstName NVARCHAR(45)`
,MiddleName NVARCHAR(45)`
,NamePrefix NVARCHAR(45)`
,NameSuffix NVARCHAR(45)`
,BirthDate NVARCHAR(45)
,MedicalRecordNumber NVARCHAR(45)
,Gender NVARCHAR(1)
,AdmitState NVARCHAR(45)
,AdmitDateTime NVARCHAR(45)
,DischargeDateTime NVARCHAR(45)
,SSN NVARCHAR(12)
,PatientType NVARCHAR(45)
,HospitalService NVARCHAR(45)
,Location NVARCHAR(45)
,DischargeDisposition NVARCHAR(45)
)
DELETE
FROM PatientVisits
OUTPUT
DELETED.PatientVisitID
,DELETED.VisitNumber
,DELETED.LastName
,DELETED.FirstName
,DELETED.MiddleName
,DELETED.NamePrefix
,DELETED.NameSuffix
,DELETED.BirthDate
,DELETED.MedicalRecordNumber
,DELETED.Gender
,DELETED.AdmitState
,DELETED.AdmitDateTime
,DELETED.DischargeDateTime
,DELETED.SSN
,DELETED.PatientType
,DELETED.HospitalService
,DELETED.Location
,DELETED.DischargeDisposition
INTO #Output
where
CURRENT_TIMESTAMP - 33 > cast(convert(varchar,AdmitDateTime,101) as DATETIME)
AND PatientType NOT IN ('01','12')
SELECT * FROM #Output`
You have something awry with your data and/or your query.
Consider the following simplified demo
IF NOT EXISTS
(
SELECT
*
FROM
sys.schemas AS S
INNER JOIN sys.tables AS T
ON S.schema_id = T.schema_id
WHERE
S.name = 'dbo'
AND T.name = 'so_36868244'
)
BEGIN
CREATE TABLE dbo.so_36868244
(
SSN nvarchar(12) NOT NULL
);
END
INSERT INTO
dbo.so_36868244
(
SSN
)
SELECT
D.SSN
FROM
(
VALUES
(N'111-22-3333')
, (N'222-33-4444')
, (N'222-33-4445')
, (N'222-33-4446')
) D(SSN)
LEFT OUTER JOIN
dbo.so_36868244 AS S
ON S.SSN = D.SSN
WHERE
S.SSN IS NULL;
We now have a table with a single column and 5 rows of a data.
I used the following query which uses the OUTPUT clause to push the DELETED data into an table variable and then selects from it
DECLARE
#output table
(
SSN nvarchar(12) NOT NULL
);
DELETE TOP (2) S
OUTPUT
Deleted.SSN
INTO
#output ( SSN )
FROM
dbo.so_36868244 AS S
SELECT O.SSN FROM #output AS O;
Run that 3 times you'll end up with 2 rows, 2 rows and no rows. No problem, rerun the first query and you have 4 rows again - hooray for idempotent operations.
I used that query as the source for an OLE DB Source and then wrote the data to a flat file.
Reproduction
Biml, the business intelligence markup language, allows me to use a simplified XML dialect to describe an SSIS package. The following Biml, when fed through the Biml engine, will be translated into an SSIS package for whichever version of SQL Server you are working with.
Sound good? Go grab BimlExpress, it's free and install it for your version of SSIS.
Once installed, under the BimlExpress menu select "Add New Biml File". Paste the following
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Connections>
<OleDbConnection Name="tempdb" ConnectionString="Data Source=localhost\dev2014;Initial Catalog=tempdb;Provider=SQLNCLI11.0;Integrated Security=SSPI;"/>
<FlatFileConnection FilePath="C:\ssisdata\so\output\so_36868244.txt" FileFormat="FFF so_36868244" Name="FFCM" />
</Connections>
<FileFormats>
<FlatFileFormat Name="FFF so_36868244" IsUnicode="false" ColumnNamesInFirstDataRow="true" FlatFileType="Delimited">
<Columns>
<Column Name="SSN" DataType="String" Length="12" Delimiter="CRLF" />
</Columns>
</FlatFileFormat>
</FileFormats>
<Packages>
<Package Name="so_36868244">
<Tasks>
<Dataflow Name="DFT Stuff">
<Transformations>
<OleDbSource ConnectionName="tempdb" Name="SQL Stuff">
<DirectInput><![CDATA[DECLARE
#output table
(
SSN nvarchar(12) NOT NULL
);
DELETE TOP (2) S
OUTPUT
Deleted.SSN
INTO
#output ( SSN )
FROM
dbo.so_36868244 AS S
SELECT O.SSN FROM #output AS O;]]></DirectInput>
</OleDbSource>
<DerivedColumns Name="DER Placeholder"></DerivedColumns>
<FlatFileDestination ConnectionName="FFCM" Name="FFDST Extract" Overwrite="true" />
</Transformations>
</Dataflow>
</Tasks>
</Package>
</Packages>
</Biml>
Edit lines 3 and 4 to be valid database connection strings (mine is using tempdb on a named instance of DEV2014) as well as point at a valid path on disk (mine is using C:\ssisdata\so\output)
Right click on the bimlscript.biml file and out pops a package named so_36868244 which should be able to run immediately and generate a flat file with contents like
SSN
111-22-3333
222-33-4444
What's wrong with your example
Without access to your systems and/or sample data, it's very hard to say.
I will give you unsolicited advice though that will improve your development career. You should avoid shorthand notation like CURRENT_TIMESTAMP - 33 It's unclear what the result will be and saves a negligible amount of keystrokes compared to DATEADD(DAY, -33, CURRENT_TIMESTAMP)
cast(convert(varchar,AdmitDateTime,101) as DATETIME) There are also far more graceful mechanisms of dropping the time portion of a date than this.
You may try with temp tables -
Put this statement in a Execute SQL task, and set the connection manager's RetainSameConnection to 'True' (this will make sure the temp table will be visible in another tasks)
IF OBJECT_ID('tempdb..##DeletedRows') IS NOT NULL
DROP TABLE ##DeletedRows
CREATE TABLE ##DeletedRows(EmpId TINYINT, EmpName VARCHAR(10))
DELETE
FROM dbo.Emp
OUTPUT
DELETED.EmpId,
DELETED.EmpName
INTO ##DeletedRows
Next, use a Data flow task and set the Data flow task's Delay Validation property to True. Drop an OLE DB Source task and a Flat File Destination.
For first time, run this statement in db
CREATE TABLE ##DeletedRows(EmpId TINYINT, EmpName VARCHAR(10))
In the OLE DB Source, use the sql statement
SELECT * FROM ##DeletedRows
and the map the columns to your Flat file. Since we want to initially map the columns from OLE DB Source to Flat File, so we created the temp table in db. Since the Delay validation is set to True, so from next time we don't need to create the temp table manually.
You would need to make it a real (permanent) table. Table variables and Temp tables created in one Execute SQL task, are not available in other Execute SQL tasks.
You can always drop the permanent table when you are done with it.

Resources