Setup airflow - mssql connection without using airflow UI - sql-server

I have a working airflow - docker - mssql connection (mssql is in airflow container, and I am able to communicate with mssql by CLI).
I would like to creat a DAG which is able to read a database without using airflow UI.
With a huge help of this(https://stackoverflow.com/a/47464524/20325057) I changed env. variable(AIRFLOW_CONN_MY_MSSQL: mssql+pyodbc://sa:Database2022#localhost,1433) in docker-compose.yaml like this:
...
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth'
AIRFLOW_CONN_MY_MSSQL: mssql+pyodbc://sa:Database2022#localhost,1433
_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
...
And created a simple DAG like that:
from airflow import DAG
from airflow.providers.microsoft.mssql.operators.mssql import MsSqlOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime.now()
}
with DAG("airflow_mssql_test3", schedule='#daily', default_args=default_args, catchup=False) as dag:
read_mssql = MsSqlOperator(
task_id='reading_mssql',
mssql_conn_id='my_mssql',
# printing first row of the table:
sql=r"""SELECT TOP 1 FROM stores;""",
)
In PyCharm I get this single line error:
WARNING:root:OSError while attempting to symlink the latest log directory
I have Windows 10 but it's running in Docker container, so I do not know where the problem is.
At the same time in Airflow (2.4.1) I have this error:
from airflow.providers.microsoft.mssql.operators.mssql import MsSqlOperator
ModuleNotFoundError: No module named 'airflow.providers.microsoft.mssql'
which is strange as well as providers and packages are installed.
I am really stuck, and would appreciate any kind of help.

Related

Querying MS Sql Server from a Jenkins Pipeline

I've been using Jenkins (2.289.3) in a docker container (https://hub.docker.com/r/jenkins/jenkins). The next update to Jenkins 2.312 migrates the docker container from Java 8 to Java 11.
I have some pipelines that use the sourceforge jdbc driver to query SQL server (http://jtds.sourceforge.net/)
Example:
import java.sql.DriverManager
import groovy.sql.Sql
con = DriverManager.getConnection('jdbc:jtds:sqlserver://servername', 'user', 'password');
stmt = con.createStatement();
To make this work, in the Docker container on Java 8 I ran this on the docker container
cp jtds-1.3.1.jar ${JAVA_HOME}/jre/lib/ext
Which loads the jar for use inside Jenkins. This method no longer exists with Java 11.
It seems pipelines have added the #Grab syntax, eg
#Grab(group='net.sourceforge.jtds', module='jtds', version='1.3.1')
If I add this to my pipline, I can see the Jars are downloaded in /var/jenkins_home/.groovy/grapes/ but it doesn't seem to actually load the jar
java.lang.ClassNotFoundException: net.sourceforge.jtds.jdbc.Driver
or
java.sql.SQLException: No suitable driver found for jdbc:jtds:sqlserver://servername
depending on which commands I run. Either way, it appears to be due to the jar not being loaded.
All the groovy examples use
#GrabConfig(systemClassLoader=true)
But this appears to not be supported in pipelines.
I've considered using a command line client, but I need to parse the results of queries and I haven't seen a tool that works well for this (ie, one that would load results into a json file or similar)
I've also tried setting the -classpath argument in the docker container, eg
ENV JAVA_OPTS=-classpath /var/jenkins_home/test/jtds-1.3.1.jar
Running ps in the docker container, I can see that the java process runs with the classpath command line option specified, but it doesn't seem to actually load the jar for use.
I'm a bit lost on how to get this working, can anyone help? Thanks.
Well, I've found a workaround. It doesn't seem ideal, but it does work
The original code
import java.sql.DriverManager
import groovy.sql.Sql
con = DriverManager.getConnection('jdbc:jtds:sqlserver://servername', 'user', 'password');
stmt = con.createStatement();
Assuming we have the jar saved in /var/jenkins_home/test/jtds-1.3.1.jar it can be updated to:
import java.sql.DriverManager
import groovy.sql.Sql
def classLoader = this.class.classLoader
while (classLoader.parent) {
classLoader = classLoader.parent
if(classLoader.getClass() == java.net.URLClassLoader)
{
// load our jar into the urlclassloader
classLoader.addURL(new File("/var/jenkins_home/test/jtds-1.3.1.jar").toURI().toURL())
break;
}
}
// register the class
Class.forName("net.sourceforge.jtds.jdbc.Driver")
con = DriverManager.getConnection('jdbc:jtds:sqlserver://servername', 'user', 'password');
stmt = con.createStatement();
Once this code has been run once, the jar seems to be accessible globally (even in other pipelines that don't load the jar).
Based on this, it seems like a good way to handle this is on the Jenkins initialization, rather than in the script at all. I created /var/jenkins_home/init.groovy with these contents:
def classLoader = this.class.classLoader
while (classLoader.parent) {
classLoader = classLoader.parent
if(classLoader.getClass() == java.net.URLClassLoader)
{
classLoader.addURL(new File("/var/jenkins_home/jars/jtds-1.3.1.jar").toURI().toURL())
break;
}
}
Class.forName("net.sourceforge.jtds.jdbc.Driver")
And after that, the scripts seem to behave similar to how I think it should work with the Jar in the classpath.

Airflow 2.0 SQL Server connector

I want to use the airflow 2.01 docker-compose file from apaches github.
here is the link docker-compose.yaml and here is the link to the dockerfiledockerfile
I want to use a Dag which should grab data out of my SQL Server database. Actually I get the following error:
no module named pymssql
After I manually installed it, I get an error like no module named pyodbc.
When I want to install this manually I get an gcc error, that it is not possible to install.
Does anyone have any clue about this?
Is there any docker-compose file which is able to handle SQL Server connection for Airflow 2?
Thanks in advance

import pyodbc, No module name 'pyodbc'

I am pretty new to all this so please bear with me.
I am trying to deploy my code in AWS Elastic-Bean Stalk and my code has pyodbc package to fetch data from the database. The database is deployed on Microsoft Azure and it's connected to the code. After deploying code to Elastic-bean it's showing error
import pyodbc no module name pyodbc
I have checked the requirement.txt file and it has latest version of pyodbc package. I did update all the versions that I have imported. There are students who have done the same process(database on Azure and Code on AWS ElasticBean Stack) and its running fine. My code is running perfectly fine on the local machine.
any leads ???
The install pyodbc may be failing, thus its not getting installed.
The reason is, that on EB for Amazon Linux 2, you need to have gcc-c++ and unixODBC-devel as prerequisites for pyodbc.
Thus in your .ebextentaions you can add a config file .ebextentaions/10_packages.config with the content of:
packages:
yum:
gcc-c++: []
unixODBC-devel: []

Trying to query Azure SQL Database with Azure ML / Docker Image

I wanted to do a realtime deployment of my model on azure, so I plan to create an image which firsts queries an ID in azure SQL db to get the required features, then predicts using my model and returns the predictions. The error I get from PyODBC library is that drivers are not installed
I tried it on the azure ML jupyter notebook to establish the connection and found that no drivers are being installed in the environment itself. After some research i found that i should create a docker image and deploy it there, but i still met with the same results
driver= '{ODBC Driver 13 for SQL Server}'
cnxn = pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password+';Encrypt=yes'+';TrustServerCertificate=no'+';Connection Timeout=30;')
('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'ODBC
Driver 13 for SQL Server' : file not found (0) (SQLDriverConnect)")
i want a result to the query instead i get this message
and/or you could use pymssql==2.1.1, if you add the following docker steps, in the deployment configuration (using either Environments or ContainerImages - preferred is Environments):
from azureml.core import Environment
from azureml.core.environment import CondaDependencies
conda_dep = CondaDependencies()
conda_dep.add_pip_package('pymssql==2.1.1')
myenv = Environment(name="mssqlenv")
myenv.python.conda_dependencies=conda_dep
myenv.docker.enabled = True
myenv.docker.base_dockerfile = 'FROM mcr.microsoft.com/azureml/base:latest\nRUN apt-get update && apt-get -y install freetds-dev freetds-bin vim gcc'
myenv.docker.base_image = None
Or, if you're using the ContainerImage class, you could add these Docker Steps
from azureml.core.image import Image, ContainerImage
image_config = ContainerImage.image_configuration(runtime= "python", execution_script="score.py", conda_file="myenv.yml", docker_file="Dockerfile.steps")
# Assuming this :
# RUN apt-get update && apt-get -y install freetds-dev freetds-bin vim gcc
# is in a file called Dockerfile.steps, it should produce the same result.
See this answer for more details on how I've done it using an Estimator Step and a custom docker container. You could use this Dockerfile to locally create a Docker container for that Estimator step (no need to do that if you're just using an Estimator run outside of a pipeline) :
FROM continuumio/miniconda3:4.4.10
RUN apt-get update && apt-get -y install freetds-dev freetds-bin gcc
RUN pip install Cython
For more details see this posting :using estimator in pipeline with custom docker images. Hope that helps!
Per my experience, I think the comment as #DavidBrowne-Microsoft said is right.
There is a similar SO thread I am getting an error while connecting to an sql DB in Jupyter Notebook answered by me, which I think it will help you to install the latest msodbcsql driver for Linux on Microsoft Azure Notebook or Docker.
Meanwhile, there is a detail about the connection string for Azure SQL Database which you need to carefully note, that you should use {ODBC Driver 17 for SQL Server} instead of {ODBC Driver 13 for SQL Server} if your Azure SQL Database had been created recently (ignore the connection string shown in Azure portal).
you can use AzureML built in solution dataset to connect to your SQL server.
To do so, you can first create an azure_sql_database datastore. reference here
Then create a dataset by passing the datastore you created and the query you want to run.
reference here
sample code
from azureml.core import Dataset, Datastore, Workspace
workspace = Workspace.from_config()
sql_datastore = Datastore.register_azure_sql_database(workspace = workspace,
datastore_name = 'sql_dstore',
server_name = 'your SQL server name',
database_name = 'your SQL database name',
tenant_id = 'your directory ID/tenant ID of the service principal',
client_id = 'the Client ID/Application ID of the service principal',
client_secret = 'the secret of the service principal')
sql_dataset = Dataset.Tabular.from_sql_query((sql_datastore, 'SELECT * FROM my_table'))
You can also do it via UI at ml.azure.com where you can register an azure SQL datastore using your user name and password.

Airflow initdb command fails after linking with postgresql

I am trying to connect Airflow with a Postgresql DB.
When in airflow.cfg I change the sql_alchemy_conn = spostgresql+psycopg2://127.0.0.1:5432/airflow, where airflow is the name of my DB which is installed on the same machine.
After updating the config file, I run airflow initdb and get the following error which I cannot understand:
File "/some_path/env/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 232, in load
"Can't load plugin: %s:%s" % (self.group, name)
sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:spostgresql.psycopg2
I found this on the web, which seems to "solve" this problem, but the solution was not clear to me at all.
Can someone tell me what the problem is and how to solve it?
Looks to me like you have a typo in your sqlalchemy connection string (s at the beginning of postgres). Try changing:
spostgresql+psycopg2://127.0.0.1:5432/airflow
to
postgresql+psycopg2://127.0.0.1:5432/airflow

Resources