JavaKerberos authentication to SQL Server on Spark framework - sql-server

I am trying to get a spark cluster to write to SQL server using JavaKerberos with Microsoft's JDBC driver (v7.0.0) (i.e., I specify integratedSecurity=true;authenticationScheme=JavaKerberos in the connection string) with credentials specified in a keyTab file and I am not having much success (the problem is the same if I specify credentials in the connections string).
I am submitting the job to the cluster (4-node YARN mode v 2.3.0) with:
spark-submit --driver-class-path mssql-jdbc-7.0.0.jre8.jar \
--jars /path/to/mssql-jdbc-7.0.0.jre8.jar \
--conf spark.executor.extraClassPath=/path/to/mssql-jdbc-7.0.0.jre8.jar \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/path/to/SQLJDBCDriver.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/path/to/SQLJDBCDriver.conf" \
application.jar
Things work partially: the spark driver authenticates correctly and creates the table, however when any of the executors come to write to the table they fail with an exception:
java.security.PrivilegedActionException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
Observations:
I can get everything to work if I specify SQL server credentials (however I need to use integrated security in my application)
The keytab and login module file “SQLJDBCDriver.conf” seem to be specified correctly since they work for the driver
I can see in the spark UI the executors pick up the correct command line options :
-Djava.security.auth.login.config=/path/to/SQLJDBCDriver.conf
After a lot of logging/debugging the difference in spark driver and executor behaviour, it seems to come down to the executor trying to use the wrong credentials even though the options specified should make it use those specified in the keytab file as it does successfully for the spark driver. (That is why it generates that particular exception which is what it does if I try deliberately incorrect credentials.)
Strangely, I can see in the debug output the JDBC driver finds and reads the SQLJDBCDriver.conf file and the keytab has to present (otherwise I get file not found failure) yet it then promptly ignores them and tries to use default behaviour/local user credentials.
Can anyone help me understand how I can force the executors to use credentials provided in a keytab or otherwise get JavaKerberos/SQL Server authentication to work with Spark?

Just to give an update on this, I've just closed https://issues.apache.org/jira/browse/SPARK-12312 and now it's possible to do Kerberos authentication with built-in JDBC connection providers. There are many providers added and one of them is MS SQL. Please read the documentation how to use it: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Please be aware Spark 3.1 is not yet released so it will take some time when the newly added 2 configuration parameters appear on the page (keytab and principal). I think the doc update will happen within 1-2 weeks.

Integrated authentication does not work with MS SQLServer JDBC driver in a secure cluster with AD integration as the containers will not have the context as the Kerberos tokens are lost when the mappers spawn (as the YARN transitions the job to its internal security subsystem).
Here is my repo that was used as work around to get Kerberos/AD authentication https://github.com/chandanbalu/mssql-jdbc-krb5 solution implements a Driver that overrides connect method of the latest MS SQL JDBC Driver (mssql-jdbc-9.2.1.jre8.jar), and will get a ticket for keytab file/principal, and gives this connection back.
You can grab the latest build of this custom driver from release folder here
Start spark-shell with JARS
spark-shell --jars /efs/home/c795701/.ivy2/jars/mssql-jdbc-9.2.1.jre8.jar,/efs/home/c795701/mssql-jdbc-krb5/target/scala-2.10/mssql-jdbc-krb5_2.10-1.0.jar
Scala
scala>val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:krb5ss://<SERVER_NAME>:1433;databasename=<DATABASE_NAME>;integratedSecurity=true;authenticationScheme=JavaKerberos;krb5Principal=c795701#NA.DOMAIN.COM;krb5Keytab=/efs/home/c795701/c795701.keytab").option("driver","hadoop.sqlserver.jdbc.krb5.SQLServ, "dbo.table_name").load()
scala>jdbcDF.count()
scala>jdbcDF.show(10)
spark-submit command
com.spark.SparkJDBCIngestion - Spark JDBC data frame operations
ingestionframework-1.0-SNAPSHOT.jar - Your project build JAR
spark-submit \
--master yarn \
--deploy-mode cluster \
--jars "/efs/home/c795701/mssql-jdbc-krb5/target/scala-2.10/mssql-jdbc-krb5_2.10-1.0.jar,/efs/home/c795701/.ivy2/jars/scala-library-2.11.1.jar"
--files /efs/home/c795701/c795701.keytab
--class com.spark.SparkJDBCIngestion \
/efs/home/c795701/ingestionframework/target/ingestionframework-1.0-SNAPSHOT.jar

So apparently JDBC Kerberos authentication is just not possible currently on the executors according to an old JIRA here https://issues.apache.org/jira/browse/SPARK-12312. The behaviour is the same as of version 2.3.2 according to the spark user list and my testing.
Workarounds
Use kinit and then distribute the cached TGT to the executors as detailed here: https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Executors_Kerberos_HowTo.md. I think this technique only works for the user that spark executors run under. At least I couldn't get it to work for my use case.
Wrap the jdbc driver with a custom version that deals with the authentication and then calls and returns a connection from the real MS JDBC driver. Details here: https://datamountaineer.com/2016/01/15/spark-jdbc-sql-server-kerberos/ and the associated repo here: https://github.com/nabacg/krb5sqljdb. I got this technique to work though I had to modify the authentication code for my case.

as Gabor Somogyi said.
you need to use spark > 3.1.0 and keytab and principal arguments
I have 3.1.1.
Throw a keytab on the same path for ALL HOST and machine where you use your code - and keep keytab up to date
add to connection string value integratedSecurity=true;authenticationScheme=JavaKerberos;
reading block will look like:
jdbcDF = (spark.read
.format("com.microsoft.sqlserver.jdbc.spark")
.option("url", url)
.option("dbtable", table_name)
.option("principal", "username#domen")
.option("keytab", "sameALLhostKEYTABpath")
.load()
)

Related

Connecting Jmeter to an Oracle database with two hosts and service name

I am trying to connect Jmeter to a geo redundant database with two hosts and I am struggling with finding the right Database Url format.
This is how my connection string looks like:
jdbc:oracle:thin:#(DESCRIPTION=(ENABLE=BROKEN)(FAILOVER=on)(CONNECT_TIMEOUT=5sec)(TRANSPORT_CONNECT_TIMEOUT=3sec)(RETRY_COUNT=3)(LOAD_BALANCE=on)(ADDRESS_LIST=(LOAD_BALANCE=on)(ADDRESS=(PROTOCOL=TCP)(HOST=HostName)(PORT=port)))(ADDRESS_LIST=(LOAD_BALANCE=on)(ADDRESS=(PROTOCOL=TCP)(HOST=HostName2)(PORT=port)))(CONNECT_DATA=(SERVICE_NAME=ServiceName)))
Database Connection Configuration is as following:
JDBC Driver Class: oracle.jdbc.OracleDriver Username: username
Password: password
For the Database URL I tried different formats and I keep getting the error:
Cannot load JDBC driver class 'oracle.jdbc.OracleDriver'
Note that the ojdbc.jar file is in the /lib folder as per the Jmeter documentation. Also, the ports are the same for both hosts.
Any suggestion is welcome. :)
I don't think you will be able to establish the connection to Oracle RAC using JMeter's JDBC Connection Configuration as it doesn't allow full flexibility therefore you will not be able to properly instantiate the PoolDataSourceFactory
So I would recommend switching to JSR223 Test Elements and Groovy language where you will have the full freedom when it comes to setting up the connection, executing queries, accessing results, etc. The relevant code would be something like:
def prop = new Properties()
prop.put('oracle.jdbc.thinForceDNSLoadBalancing','true')
PoolDataSource pds = PoolDataSourceFactory.getPoolDataSource()
pds.setConnectionProperties(prop)
pds.setConnectionFactoryClassName('oracle.jdbc.pool.OracleDataSource'); pds.setUser('johndoe')
pds.setPassword('secret')
String dbURL =
'jdbc:oracle:thin:#(DESCRIPTION=(ENABLE=BROKEN)(FAILOVER=on)(CONNECT_TIMEOUT=5sec)' +
'(TRANSPORT_CONNECT_TIMEOUT=3sec)(RETRY_COUNT=3)(LOAD_BALANCE=on)(ADDRESS_LIST=(LOAD_BALANCE=on)' +
'(ADDRESS=(PROTOCOL=TCP)(HOST=HostName)(PORT=port)))(ADDRESS_LIST=(LOAD_BALANCE=on)' +
'(ADDRESS=(PROTOCOL=TCP)(HOST=HostName2)(PORT=port)))(CONNECT_DATA=(SERVICE_NAME=ServiceName)))'
pds.setURL(dbURL)
More information: Configuring Fast Connection Failover for JDBC Clients
It appears to be working with a connection string containing only host 1.
The Database URL is in the form:
jdbc:oracle:thin:#<hostname>:<port>/<serviceName>
Additionally, I got the error because the .jar file's path was not added to the classpath (click on Test Plan, in the bottom select browse next to Add directory or jar to classpath and select your odbc jar).
Another thing that was wrong was the Validation query, it should be "select 1 from dual" and also the query should not contain any semicolon at the end.
I hope this help people with the same issue.

Google Data Studio MySql data source connection does not exist Error

Platform: Google Data Studio
Data Source: MySQL
Connection was working before,
meaning no issues with credentials.
All of a sudden, getting the below error:
All IPs have been whitelisted from the google data studio list of ips.
The only thing that comes to mind is a limitation of GDS to process data.
The data source table has around 200K+ rows.
Not sure what is the limitation for GDS with MySQL.
There's no indication anywhere.
Anyone out there can help to solve this or maybe provide some info would be appreciated.
Thanks
If you use a firewall, be sure to double check the Google ip adresses. They may have added new ips (in my case, the last one was missing).
Check them here !
After doing so, I had to change the Host name of the connection to the database for a url alias (www.yourserver.com <- url pointing on your server), and change it back to the IP to make it work.
Sounds like a the connector cannot establish a new connection.
Cloud SQL Connector:
At the time of writing this, the connector seems unable to establish a new connection once the existing one has timed out and modifying the JDBC url to include query parameters gives you an error when authenticating.
This is probably due to the connector appending it's own parameters.
(Seems to be a possible bug here when a connection no longer exists)
MySQL Connector (with IP Address):
This connector allows you to add query parameters to the JDBC url. Enable SSL and append useSSL=true to the url.
e.g.jdbc:mysql://<ip>/<database>?useSSL=true
This worked as expected and establishes new connections when required.
Example Source Setup
Suffering from this issue too, my experience is that using the MySQL connector instead of the Cloud SQL Connector provides better stability in combination with setting wait_timeout to a value above 12 hours.
This issue has been reported on the official Google Data Studio bug tracker. Please vote them up if you are also suffering from this !
🐛 130205306 MySQL connection does not exist Apr 9, 2019 04:36PM
🐛 118470083 Data source password not stored for MySQL sources. Oct 26, 2018 01:24PM

Setting up samba 4 AD with an LDAP backend

Case:
For a couple of months now I've been following various tutorials, documentation and examples but somehow my end result always ends up not working like in any of the tutorials.
What I need to do is set up an active directory using Samba 4.0 on an Ubuntu Server 16.04 LTS. The samba should use a ldap-backend that is running on another Ubuntu Server 16.04 LTS. Windows clients will use the lan to login to the domain with ldap accounts.
A bonus would be to have a master-master connection from that ldap server to another ldap server, but since I already succeeded in doing something similar like that I will focus on the problem of setting up the Samba with Ldap backend.
I'm getting pertty frustrated since even though I follow tutorials and read a lot about the subject, it somehow never ends up in the result in which I can actually login to the domain, be it a samba account, be it ldap. The only thing close to this is that I at some point was able to login with a unix account, but no active directory services at that time.
Documentation that I followed:
https://help.ubuntu.com/lts/serverguide/samba-ldap.html
https://wiki.samba.org/index.php/Samba,_Active_Directory_%26_LDAP
https://help.ubuntu.com/lts/serverguide/samba-dc.html
https://www.techrepublic.com/article/how-to-configure-ubuntu-linux-server-as-a-domain-controller-with-samba-tool/
Steps performed:
Used servers:
- cloud.smoothalicious.info
- router.smoothalicious.info
- monfig.smoothalicious.info
In this order:
Installed ldap on both cloud and router. After which I implemented replication services succesfully. Cloud is the master (producer) and router is the slave (consumer). After this I imported the samba scheme and added the samba indices on the master ldap (cloud). Although replication was succesfull before, it failed with the samba indices without any error messages in syslog, auth.conf or any logs of ldap. Manually I added the indices on my own, giving up on replication at that time.
On monfig I installed Samba 4.0 and used the samba provision tool to configure it. Although I could finally find the active directory through a Windows 10 client, I could not login to it with a samba user account which I added to the domain.
The above steps are that of my previous setup, the new one follows.
Since this obviously was a big bust I decided to start over with a new tutorial. This was just setting up a Samba AD with a ldap-backend. (source: https://www.unixmen.com/setup-samba-domain-controller-with-openldap-backend-in-ubuntu-13-04/) This time I got as far as populating the ldap tree with smbldap-populate, which was succesful. Unfortunatly I was not able to find those groups with getent group. The error I get is:
nss_ldap: failed to bind to LDAP server ldapi:///cloud.smoothalicious.info: Can't contact LDAP server
Side note:
I don't seek answers, although they are welcome. I seek a tutorial that I can follow that does not end in me having different results that the tutorials shows me, even though I followed it in the detail <- this is frustrating, and it happens a lot.
LDAP backend for samba 4 is not supported:
https://wiki.samba.org/index.php/FAQ#Do_Samba_AD_DCs_Support_OpenLDAP_or_Other_LDAP_Servers_as_the_Back_End.3F
there's some work being done with it but it's far from being ready for production.
lot of people is asking for it but it seems that samba devs adopted a make-all-other-systems-acomodate-to-me approach.

java.sql.SQLException: [tibcosoftwareinc][Oracle JDBC Driver][Oracle]ORA-28040: No matching authentication protocol

I get above error while trying to connect oracle 12c. I try using ojdbc6 and ojdbc7 jar files. I found below comment
------------------->
Bug 14575666
In 12.1, the default value for the SQLNET.ALLOWED_LOGON_VERSION parameter has been updated to 11. This means that database clients using pre-11g JDBC thin drivers cannot authenticate to 12.1 database servers unless theSQLNET.ALLOWED_LOGON_VERSION parameter is set to the old default of 8.
This will cause a 10.2.0.5 Oracle RAC database creation using DBCA to fail with the ORA-28040: No matching authentication protocol error in 12.1 Oracle ASM and Oracle Grid Infrastructure environments.
Workaround: Set SQLNET.ALLOWED_LOGON_VERSION=8 in the oracle/network/admin/sqlnet.ora file.
<-------------------
I have one dought to implement above workaround as we have shared database.
If I set SQLNET.ALLOWED_LOGON_VERSION=8 in the oracle/network/admin/sqlnet.ora file will it affect other users ?
Will it affect shared applications and its functionality ?
Setting SQLNET.ALLOWED_LOGON_VERSION=8 in sqlnet.ora affects all connections to the server. You're allowing user authentication with older versions of the password verifier and it affects all users. You can't allow it for just one user. But this isn't going to break other applications that can already connect successfully. It will allow older applications (that use old drivers) to connect too. The best solution is to upgrade all clients if possible but this setting is the workaround and it was made available for this exact purpose.

Connecting to Google Cloud SQL from Eclipse Not Using App Engine

We are trying to connect to Google Cloud SQL from Eclipse using the Database Development perspective. To do so I'm trying to add a new Database Connection, which I was able to do successfully for a local MySQL instance running on my machine.
The motivation for doing this is that we currently run our JUnit tests against the local instance. However, we are switching to Hibernate and want to make sure that all of our configuration files work with Cloud SQL. As a general guide I've been using:
https://developers.google.com/appengine/articles/using_hibernate
We're diverging slightly in that we're using hibernate.cfg.xml instead of persistence.xml, but I don't think this will actually have a bearing on the current issue of simply connecting to the database. From another answer as well as some Google documentation I'm aware that I can't use the com.google.appengine.api.rdbms.AppEngineDriver, because that needs to be run from an AppEngine instance. Instead I'm trying to follow the directions here:
https://developers.google.com/cloud-sql/docs/external
and am using com.mysql.jdbc.Driver.
I have assigned my Cloud SQL instance an ip address and have added my current ip address to the whitelist, as described here:
https://developers.google.com/cloud-sql/docs/access-control#appaccess
My driver is the Connector/J driver I've been using successfully with the local instance, and the url I'm using is:
jdbc:google:rdbms://my-app:my-cloud-sql-instance/myDatabase
which I got based on:
https://developers.google.com/appengine/articles/using_hibernate
After adding the connection and setting the information I click Test Connection, which worked successfully on my local instance. However, this throws the following error:
java.lang.Exception: Connection failed with unspecified error.
at org.eclipse.datatools.connectivity.DriverConnectionBase.internalCreateConnection(DriverConnectionBase.java:110)
at org.eclipse.datatools.connectivity.DriverConnectionBase.open(DriverConnectionBase.java:54)
at org.eclipse.datatools.connectivity.drivers.jdbc.JDBCConnection.open(JDBCConnection.java:73)
at org.eclipse.datatools.enablement.internal.mysql.connection.JDBCMySQLConnectionFactory.createConnection(JDBCMySQLConnectionFactory.java:28)
at org.eclipse.datatools.connectivity.internal.ConnectionFactoryProvider.createConnection(ConnectionFactoryProvider.java:83)
at org.eclipse.datatools.connectivity.internal.ConnectionProfile.createConnection(ConnectionProfile.java:359)
at org.eclipse.datatools.connectivity.ui.PingJob.createTestConnection(PingJob.java:76)
at org.eclipse.datatools.connectivity.ui.PingJob.run(PingJob.java:59)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:53)
Obviously this isn't very helpful.
I've tried fiddling with the url, tried a number of users (none of which require passwords, so I'm leaving the password fields blank), and different versions of the driver for different versions of MySQL. Nothing has worked.
There are perhaps more deep-seated issues with doing it this way, such as how I will easily switch between test and deployment versions of my hibernate.cfg.xml, and I don't have good answers. I was just planning on editing them by hand back to the AppEngineDriver, which means I might run into further configuration issues at that point even if the JUnit tests are passing. Nevertheless, I think getting a connection set up to Cloud SQL that will allow JUnit testing will be a step in the right direction. I'd appreciate any input!
You should use jdbc:mysql://<cloudsql-instance-ip>:3306/<database-name> to connect from an external network. The connection string you are using is to connect from Google App Engine.

Resources