R: Best Practices - dplyr and odbc multi table actions (retrieved from SQL)

R: Best Practices - dplyr and odbc multi table actions (retrieved from SQL) - sql-server

Say you have your tables stores in an SQL server DB, and you want to perform multi table actions, i.e. join several tables from that same database.
Following code can interact and receive data from SQL server:
library(dplyr)
library(odbc)
con <- dbConnect(odbc::odbc(),
.connection_string = "Driver={SQL Server};Server=.;Database=My_DB;")
Table1 <- tbl(con, "Table1")
Table1 # View glimpse of Table1
Table2 <- tbl(con, "Table2")
Table2 # View glimpse of Table2
Table3 <- tbl(con, "Table3")
However, with a few results retrieved with the same connection, eventually following error occurs:
Error: [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt
My current googling skills have taking me to the answer that the backend does not support multiple active result sets (MARS) - I guess more than 2 active result sets is the maximum? (backend is DBI and odbc)
So, my question is: what is best practice if I want to collect data from several tables from an SQL DB?
Open a connection for each table?
Actively close the connection and open it again for the next table?
Does the backend support MARS to be parsed to the connection string?

To make a connection that can hold multiple result sets, I've had luck with following connection code:
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server Native Client 11.0",
Server = "my_host",
UID = rstudioapi::askForPassword("Database UID"),
PWD = rstudioapi::askForPassword("Database PWD"),
Port = 1433,
MultipleActiveResultSets = "True",
Database = my_db)
On top of that, I found that the new pool-package can do the job:
pool <- dbPool(odbc::odbc(),
Driver = "SQL Server Native Client 11.0",
Server = "my_host",
UID = rstudioapi::askForPassword("Database UID"),
PWD = rstudioapi::askForPassword("Database PWD"),
Port = 1433,
MultipleActiveResultSets = "True",
Database = my_db)
It is quicker and more stable than the DBI connection, however, one minor drawback is that the database doesn't pop up in the connection tab for easy reference.
For both methods, remember to close the connection/pool when done. For the DBI-method its:
dbDisconnect(con)
Whereas the pool-method is closed by calling:
poolClose(pool)

Related

Pandas dataframe insert into SQL Server taking too long with execute and executemany

I have a pandas dataframe with 27 columns and ~45k rows that I need to insert into a SQL Server table.
I am currently using with the below code and it takes 90 mins to insert:
conn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};\
Server=#servername;\
Database=dbtest;\
Trusted_Connection=yes;')
cursor = conn.cursor() #Create cursor
for index, row in t6.iterrows():
cursor.execute("insert into dbtest.dbo.test( col1, col2, col3, col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,,col27)\
values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
row['col1'],row['col2'], row['col3'],,row['col27'])
I have also tried to load using executemany and that takes even longer to complete, at nearly 120mins.
I am really looking for a faster load time since I need to run this daily.

You can set fast_executemany in pyodbc itself for versions>=4.0.19. It is off by default.
import pyodbc
server_name = 'localhost'
database_name = 'AdventureWorks2019'
table_name = 'MyTable'
driver = 'ODBC Driver 17 for SQL Server'
connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes')
cursor = connection.cursor()
cursor.fast_executemany = True # reduce number of calls to server on inserts
# form SQL statement
columns = ", ".join(df.columns)
values = '('+', '.join(['?']*len(df.columns))+')'
statement = "INSERT INTO "+table_name+" ("+columns+") VALUES "+values
# extract values from DataFrame into list of tuples
insert = [tuple(x) for x in df.values]
cursor.executemany(statement, insert)
Or if you prefer sqlalchemy and dataframes directly.
import sqlalchemy as db
engine = db.create_engine('mssql+pyodbc://#'+server_name+'/'+database_name+'?trusted_connection=yes&driver='+driver, fast_executemany=True)
df.to_sql(table_name, engine, if_exists='append', index=False)
See fast_executemany in this link.
https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API

I have worked through this in the past, and this was the fastest that I could get it to work using sqlalchemy.
import sqlalchemy as sa
engine = (sa.create_engine(f'mssql://#{server}/{database}
?trusted_connection=yes&driver={driver_name}', fast_executemany=True)) #windows authentication
df.to_sql('Daily_Report', con=engine, if_exists='append', index=False)
If the engine is not working for you, then you may have a different setup so please see: https://docs.sqlalchemy.org/en/13/core/engines.html
You should be able to create the variables needed above, but here is how I get the driver:
driver_name = ''
driver_names = [x for x in pyodbc.drivers() if x.endswith(' for SQL Server')]
if driver_names:
driver_name = driver_names[-1] #You may need to change the [-1] if wrong driver to [-2] or a different option in the driver_names list.
if driver_name:
conn_str = f'''DRIVER={driver_name};SERVER='''
else:
print('(No suitable driver found. Cannot connect.)')

You can try to use the method 'multi' built in pandas to_sql.
df.to_sql('table_name', con=engine, if_exists='replace', index=False, method='multi')
The multi method allows you to 'Pass multiple values in a single INSERT clause.' per documentation.
I found it to be pretty efficient.

SQL Server Distributed queries with Teradata

I am trying to get distributed queries to run on SQL Server 2012 with a linked server to Teradata.
Connection works fine and query returns quickly if I pass the where clause into the remote SQL using openquery e.g.
select *
from openquery(td, 'select * from lib.purchases where z_PO = ''123456''')
However the below does not run as expected: SQL Server loads the entire table and performs a local filter:
select *
from openquery(td, 'select * from lib.purchases') where z_PO = '123456'
The source table has 100M records.
Obviously index play no role here as query runs just fine on TD side.
What I have tried:
sp_configure 'Ad Hoc Distributed Queries', 1
set Collation Compatible" = True on the linked server properties
Instead of 2, set Collation Name = Latin1_BIN to match (closely?) TD character set (ASCII).
Not sure collation is the issue as I get same result when filtering on numeric fields.
Somehow the so-called query optimizer in SQL Server does not push simple filtering down to the remote server.
Is this the ODBC driver's fault (using 16.10) - a setting, a bug? SQL Server 2012 (v11.0.6248.0) setting I am missing (or path req'd)?
Below is the OLEDB for ODBC properties that I capture in SQL Profiler:
<ProviderInformation>
<Provider>MSDASQL</Provider>
<LinkedServer>td</LinkedServer>
<ProviderCapabilitiesAndSettings>
<Ansi92EntrySupport>0</Ansi92EntrySupport>
<ODBCCoreSupport>1</ODBCCoreSupport>
<ODBCMinimumSupport>1</ODBCMinimumSupport>
<SimpleGrammarSupport>0</SimpleGrammarSupport>
<AnsiLikeSupport>0</AnsiLikeSupport>
<SQLLikeSupport>1</SQLLikeSupport>
<DateLiteralsSupport>0</DateLiteralsSupport>
<GroupBySupport>0</GroupBySupport>
<InnerJoinSupport>0</InnerJoinSupport>
<SubqueriesSupport>0</SubqueriesSupport>
<SimpleUpdatesSupport>0</SimpleUpdatesSupport>
<HistogramsSupport>0</HistogramsSupport>
<ColumnLevelCollationSupport>0</ColumnLevelCollationSupport>
<ConnectionSharingSupport>0</ConnectionSharingSupport>
<MultipleActiveRowsetsSupport>0</MultipleActiveRowsetsSupport>
<MultipleResultsSupport>1</MultipleResultsSupport>
<AllowLimitingRowsReturned>1</AllowLimitingRowsReturned>
<NullConcatenationYieldsNull>0</NullConcatenationYieldsNull>
<StructuredStorageAccessToLargeObjects>1</StructuredStorageAccessToLargeObjects>
<MultipleConcurrentLargeObjectSupport>0</MultipleConcurrentLargeObjectSupport>
<DynamicParametersSupport>1</DynamicParametersSupport>
<NestedQueriesSupport>1</NestedQueriesSupport>
<IndicesAvailableAsAccessPath>0</IndicesAvailableAsAccessPath>
<AllowDataAccessByReference>1</AllowDataAccessByReference>
<RowsetChangesAreVisible>0</RowsetChangesAreVisible>
<RowsetSupportsAppendOnly>0</RowsetSupportsAppendOnly>
<UseLevelZeroOledbInterfacesOnly>0</UseLevelZeroOledbInterfacesOnly>
<RowsetUpdatability>1</RowsetUpdatability>
<AsynchronousRowsetProcessingSupport>0</AsynchronousRowsetProcessingSupport>
<DataSourceUnicodeLocaleId>0</DataSourceUnicodeLocaleId>
<DataSourceUnicodeComparisonStyle>0</DataSourceUnicodeComparisonStyle>
<DataSourceCollationComparisonFlags>0</DataSourceCollationComparisonFlags>
<DataSourceCharacterset></DataSourceCharacterset>
<DataSourceSortOrder></DataSourceSortOrder>
<DataSourceNullCollationOrder>4</DataSourceNullCollationOrder>
<CurrentDbCollationSameAsDefaultRemoteDbCollation>0</CurrentDbCollationSameAsDefaultRemoteDbCollation>
<UnicodeLiteralSupport>0</UnicodeLiteralSupport>
<UnicodeLiteralPrefix></UnicodeLiteralPrefix>
<UnicodeLiteralSuffix></UnicodeLiteralSuffix>
<DateLiteralPrefix></DateLiteralPrefix>
<DateLiteralSuffix></DateLiteralSuffix>
<ObjectNameConstructionFlags>54</ObjectNameConstructionFlags>
<SchemaSeparator>.</SchemaSeparator>
<CatalogSeparator>.</CatalogSeparator>
<QuoteSeparator>"</QuoteSeparator>
<BitRemoting>0</BitRemoting>
<UnicodeLiterals>0</UnicodeLiterals>
<ProviderOledbVersion>131072</ProviderOledbVersion>
<HalloweenProtectionNeeded>1</HalloweenProtectionNeeded>
<RowsetUsableAcrossThreads>0</RowsetUsableAcrossThreads>
<ObjectNameIsSinglePart>0</ObjectNameIsSinglePart>
<Cardinality>-1</Cardinality>
<BookmarkSupport>0</BookmarkSupport>
<BookmarksReusable>0</BookmarksReusable>
<TableFlags>0</TableFlags>
</ProviderCapabilitiesAndSettings>
and here the details of the column in the filter:
DBCOLUMNINFO>
<pwszName>z_PO</pwszName>
<pTypeInfo>0x0000000000000000</pTypeInfo>
<iOrdinal>53</iOrdinal>
<dwFlags>120</dwFlags>
<ulColumnSize>10</ulColumnSize>
<wType>129</wType>
<bPrecision>255</bPrecision>
<bScale>255</bScale>
<DBID>
<eKind>DBKIND_NAME</eKind>
<uName.pwszName>z_PO</uName.pwszName>
</DBID>
</DBCOLUMNINFO>
As a FYI, the context is wrapping the openquery, joined with local data, into a SQL Server view, which is the only thing users would see - from there, they can apply any filter (WHERE) within PowerQuery (XL) or PowerBI. A way to circumvent the lack of DirectQuery support through ODBC.

R: No applicable method for 'tbl' error when making a SQL Server query using dplyr and pool

Apologies if this has been answered elsewhere, but I could not find it. With the following code I am connecting to an MS SQL database using the RJDBC mechanism using pooling from the pool package:
library(RJDBC)
library(DBI)
library(pool)
library(dplyr)
drv <-
JDBC(
"com.microsoft.sqlserver.jdbc.SQLServerDriver",
"C:/R/RJDBC/Microsoft JDBC Driver 6.0 for SQL Server/sqljdbc_6.0/enu/jre8/sqljdbc42.jar"
)
pool_instance <- dbPool(
drv = drv,
dbname = "dbasename",
url = "jdbc:sqlserver://sql01",
user = "user",
password = "password"
)
mydata <- dbGetQuery(pool_instance, "select * from my.Table")
src_pool(pool_instance) %>% tbl("my.Table") %>% head(5)
When I run this code, I make a successful connection to my SQL Server database and the dbGetQuery function call retrieves the data as expected.
However, when I call the src_pool function I get the following error message:
Error in UseMethod("tbl") : no applicable method for 'tbl' applied
to an object of class "c('src_', 'src_sql', 'src')"
If I call the function src_pool(pool_instance) separately, without piping to the tbl function, the error message is similar:
Error in UseMethod("src_desc") : no applicable method for
'src_desc' applied to an object of class "c('src_', 'src_sql', 'src')"
I expected that either dplyr or pool would provide for these methods? Do I need to write code for these methods? What am I missing?
Note that I am a newby to SQL Server database connectivity.

How do I specify a specific database in a SQL server when creating an ODBC connection on Windows?

I am working off of a server housing various SQL databases (accessed via Microsoft SQL Server Management Studio) and am going to use R to perform analyses and explore a specific database within the server. I have network security that permits communication between machines, drivers installed on the R server, and RODBC installed.
When I attempt to establish a Windows ODBC connection in the Control panel>Administrative>Data Sources, I can only add a data source for the entirety of the SQL server, not just for the specifc database I want to look at. I pasted the code I have been experimenting with below.
library(RODBC)
channel <- odbcConnect("Example", uid="xxx", pwd=****");
sqlTables(channel)
sqlTables(ch, tableType = "TABLE")
res <- sqlFetch(ch, "samp.le", max = 15) #not recognizing as a table
library(RODBC)
ch <- odbcDriverConnect('driver={"SQL Server"}; server=Example; database=dbasesample; uid="xxxx", pwd = "****"')
Response: Warning messages:
1: In odbcDriverConnect("driver={\"SQL Server\"}; server=sample; database=dbasesample; uid=\"xxxx", pwd = \"xxxx\"") :
[RODBC] ERROR: state IM002, code 0, message [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified
2: In odbcDriverConnect("driver={\"SQL Server\"}; server=sample; database=dbasesample; uid=\"xxxx\", pwd = \"xxxx!\"") :
ODBC connection failed
Any insight into this issue would be much appreciated.

Although while querying with the sqlQuery() function you can specify database, schema and table, e.g.
library(RODBC)
con = odbcConnect(dsn = 'local')
sample_query = sqlQuery(con,'select * from db.dbo.table')
I have not found a way to define the database from within the function parameters while using sqlFetch() or sqlSave(). An indirect way would be to define the default database in the dsn (as written in the comments). But then, you would need a different dsn for each database you would like to use.
A better solution would be to use the odbc and DBI packages instead of RODBC, and define the database in the connection statement e.g.
library(dplyr)
library(DBI)
library(odbc)
con <- dbConnect(dsn = 'local',database = 'db')
copy_to(con, rr2, temporary = F)
By the way, I found copy_to to be much faster than the equivalent sqlSave of RODBC.

PowerBuilder DSN Creation

I am new to PowerBuilder.
I want to retrieve the data from MSAccess tables and update it to corresponding SQL tables. I am not able to create a permanent DSN for MSAccess because I have to select different MSAccess files with same table information. I can create a permanent DSN for SQL server.
Please help me to create DSN dynamically when selecting the MSAccess file and push all the tables data to SQL using PowerBuilder.
Also give the full PowerBuilder code to complete the problem if its possible.

In Access we strongly suggest not using DSNs at all as it is one less thing for someone to have to configure and one less thing for the users to screw up. Using DSN-Less Connections You should see if PowerBuilder has a similar option.

Create the DSN manually in the ODBC administrator
Locate the entry in the registry
Export the registry syntax into a .reg file
Read and edit the .reg file dynamically in PB
Write it back to the registry using PB's RegistrySet ( key, valuename, valuetype, value )
Once you've got your DSN set up, there are many options to push data from one database to the other.
You'll need two transaction objects in PB, each pointing to its own database. Then, you could use a Data Pipeline object to manage the actual data transfer.

You want to do the DSNLess connection referenced by Tony. I show an example of doing it at PBDJ and have a code sample over at Sybase's CodeXchange.

I am using this code, try it!
//// Profile access databases accdb format
SQLCA.DBMS = "OLE DB"
SQLCA.AutoCommit = False
SQLCA.DBParm = "PROVIDER='Microsoft.ACE.OLEDB.12.0',DATASOURCE='C:\databasename.accdb',DelimitIdentifier='No',CommitOnDisconnect='No'"
Connect using SQLCA;
If SQLCA.SQLCode = 0 Then
Open ( w_rsre_frame )
else
MessageBox ("Cannot Connect to Database", SQLCA.SQLErrText )
End If
or
//// Profile access databases mdb format
transaction aTrx
long resu
string database
database = "C:\databasename.mdb"
aTrx = create transaction
aTrx.DBMS = "OLE DB"
aTrx.AutoCommit = True
aTrx.DBParm = "PROVIDER='Microsoft.Jet.OLEDB.4.0',DATASOURCE='"+database+"',PBMaxBlobSize=100000,StaticBind='No',PBNoCatalog='YES'"
connect using aTrx ;
if atrx.sqldbcode = 0 then
messagebox("","Connection success to database")
else
messagebox("Error code: "+string(atrx.sqlcode),atrx.sqlerrtext+ " DB Code Error: "+string(atrx.sqldbcode))
end if
// do stuff...
destroy atrx