Am trying to use cross apply in azure data bricks notebook it throws the error that incorrect syntax near apply while the same query is working fine in SQL server.Is azure data bricks notebook is not supporting apply operators?
enter ## Connection parameters ##
sqlserver = 'servername'
port = '1433'
database = 'dbname'
user = 'username'
pswd = "pwd"
query = "(select * from repair_detail a\
cross apply (select top 1 * from order c where a.RO_NO=c.RO_NO) b\
) AS CustSales"
## Load Data Frame ##
df1 = spark.read \
.option('user', user) \
.option('password', pswd) \
.jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' +database,
query)
# Show the resulting DataFrame
df1.show()
Related
I am playing around with spark and wanted to store a data frame in a sql database. It works but not when saving a datetime column:
from pyspark.sql import SparkSession,Row
from pyspark.sql.types import IntegerType,TimestampType,StructType,StructField,StringType
from datetime import datetime
...
spark = SparkSession.builder \
...
.getOrCreate()
# Create DataFrame
rdd = spark.sparkContext.parallelize([
Row(id=1, title='string1', created_at=datetime.now())
])
schema = StructType([
StructField("id", IntegerType(), False),
StructField("title", StringType(), False),
StructField("created_at", TimestampType(), True)
])
df = spark.createDataFrame(rdd, schema)
df.show()
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("truncate", True) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.save()
except ValueError as error :
print("Connector write failed", error)
Schema:
Error:
com.microsoft.sqlserver.jdbc.SQLServerException: Received an invalid column length from the bcp client for colid 3
From my understanding the error states that datetime.now() has invalid length. But how can that be, if it is a standard datetime? Any ideas what the issue is?
There are problems with the code to create the dataframe. You are missing libraries. The code below creates the dataframe correctly.
#
# 1 - Make test dataframe
#
# libraries
from pyspark.sql import Row
from datetime import datetime
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
# create rdd
rdd = spark.sparkContext.parallelize([Row(id=1, title='string1', created_at=datetime.now())])
# define structure
schema = StructType([
StructField("id", IntegerType(), False),
StructField("title", StringType(), False),
StructField("created_at", TimestampType(), True)
])
# create df
df = spark.createDataFrame(rdd, schema)
# show df
display(df)
The output is shown above. We need to create a table that follows the nullability and data types.
The code below creates a table called stack_overflow.
-- drop table
drop table if exists stack_overflow
go
-- create table
create table stack_overflow
(
id int not null,
title varchar(100) not null,
created_at datetime2 null
)
go
-- show data
select * from stack_overflow
go
Next, we need to define our connection properties.
#
# 2 - Set connection properties
#
server_name = "jdbc:sqlserver://svr4tips2030.database.windows.net"
database_name = "dbs4tips2020"
url = server_name + ";" + "databaseName=" + database_name + ";"
user_name = "jminer"
password = "<your password here>"
table_name = "stack_overflow"
Last, we want to execute the code to write the dataframe.
#
# 3 - Write test dataframe
#
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("truncate", True) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", user_name) \
.option("password", password) \
.save()
except ValueError as error :
print("Connector write failed", error)
Executing a select query shows that the data was written correctly.
In short, look at the documentation for Spark SQL Types. I found out that datetime2 works nicely.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.types?view=spark-dotnet
One word of caution, this code does not handle date time offset. Also, there is no data type in Spark to use in an offset mapping.
# Sample date time offset value
import pytz
from datetime import datetime, timezone, timedelta
user_timezone_setting = 'US/Pacific'
user_timezone = pytz.timezone(user_timezone_setting)
the_event = datetime.now()
localized_event = user_timezone.localize(the_event)
print(localized_event)
The code above creates a variable with the following data.
But once we cast it to a dataframe, it loses the offset since it is converted to UTC time. If UTC offset is important, you will have to pass that information as separate integer.
SQLServer datetime datatype has time range - 00:00:00 through 23:59:59.997. output of datetime.now() will not fit in for datetime, you need to change the datatype on SQLSever table to datetime2
I am trying to upload a binary.zip to SQL Server as varbinary type column content.
Target Table:
CREATE TABLE myTable ( zipFile varbinary(MAX) );
My NIFI Flow is very simple:
-> GetFile:
filter:binary.zip
-> UpdateAttribute:<br>
sql.args.1.type = -3 # as varbinary according to JDBC types enumeration
sql.args.1.value = ??? # I don't know what to put here ! (I've triying everything!)
sql.args.1.format= ??? # Is It required? I triyed 'hex'
-> PutSQL:<br>
SQLstatement= INSERT INTO myTable (zip_file) VALUES (?);
What should I put in sql.args.1.value?
I think it should be the flowfile payload, but it would work as part of the INSERT in the PutSQL? Not by the moment!
Thanks!
SOLUTION UPDATE:
Based on https://issues.apache.org/jira/browse/NIFI-8052
(Consider I'm sending some data as attribute parameter)
import java.nio.charset.StandardCharsets
import org.apache.nifi.controller.ControllerService
import groovy.sql.Sql
def flowFile = session.get()
def lookup = context.controllerServiceLookup
def dbServiceName = flowFile.getAttribute('DatabaseConnectionPoolName')
def tableName = flowFile.getAttribute('table_name')
def fieldName = flowFile.getAttribute('field_name')
def dbcpServiceId = lookup.getControllerServiceIdentifiers(ControllerService).find
{ cs -> lookup.getControllerServiceName(cs) == dbServiceName }
def conn = lookup.getControllerService(dbcpServiceId)?.getConnection()
def sql = new Sql(conn)
flowFile.read{ rawIn->
def parms = [rawIn ]
sql.executeInsert "INSERT INTO " + tableName + " (date, "+ fieldName + ") VALUES (CAST( GETDATE() AS Date ) , ?) ", parms
}
conn?.close()
if(!flowFile) return
session.transfer(flowFile, REL_SUCCESS)
session.commit()
maybe there is a nifi native way to insert blob however you could use ExecuteGroovyScript instead of UpdateAttribute and PutSQL
add SQL.mydb parameter on the level of processor and link it to required DBCP pool.
use following script body:
def ff=session.get()
if(!ff)return
def statement = "INSERT INTO myTable (zip_file) VALUES (:p_zip_file)"
def params = [
p_zip_file: SQL.mydb.BLOB(ff.read()) //cast flow file content as BLOB sql type
]
SQL.mydb.executeInsert(params, statement) //committed automatically on flow file success
//transfer to success without changes
REL_SUCCESS << ff
inside the script SQL.mydb is a reference to groovy.sql.Sql oblject
The below code tries to connect to a mssql database using pymssql. I have a CSV file and I am trying to push all the rows into a single data table in the mssql database. I am getting a 'KeyError' when I try to execute the code after opening the CSV file.
import csv
import pymssql
conn = pymssql.connect(host="host name",
database="dbname",
user = "username",
password = "password")
cursor = conn.cursor()
if(conn):
print("True")
else:
print("False")
with open ('path to csv file', 'r') as f:
reader = csv.reader(f)
columns = next(reader)
query = "INSERT INTO Marketing({'URL', 'Domain_name', 'Downloadables', 'Text_without_javascript', 'Downloadable_Link'}) VALUES ({%s,%s,%s,%s,%s})"
query = query.format(','.join('[' + x + ']' for x in columns), ','.join('?' * len(columns)))
cursor = conn.cursor()
for data in reader:
cursor.execute(query, tuple(data))
cursor.commit()
The below is the error that I get:
KeyError: "'URL', 'Domain_name', 'Downloadables', 'Text_without_javascript', 'Downloadable_Link'"
Using to_sql
file_path = "path to csv"
engine = create_engine("mssql://user:password#host/database")
df = pd.read_csv(file_path, encoding = 'latin')
df.to_sql(name='Marketing',con=engine,if_exists='append')
Output:
InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
I tried everything, from converting the parameters which were being passed to a tuple, passing it as is, but didn't help. Below is the code that helped me fix the issue:
with open ('path to csv file', 'r') as f:
for row in f:
reader = csv.reader(f)
# print(reader)
columns = next(reader)
# print(columns)
cursor = conn.cursor()
for data in reader:
# print(data)
data = tuple(data)
# print(data)
query = ("INSERT INTO Marketing(URL, Domain_name, Downloadables, Text_without_javascript, Downloadable_Link) VALUES (%s,%s,%s,%s,%s)")
parameters = data
# query = query.format(','.join('?' * len(columns)))
cursor.execute(query, parameters)
conn.commit()
Note: The connecting to the database part remains as in the question.
I was tasked with developing a tool that will accept a few parameters and then query 2 databases based on a list of tables.
There are 3 possible database options, a connection to Netezza, a connection to Oracle, or a connection to a DB2 Mainframe. In theory they will pass me the type of connection, hostname, port, database name, username, and password.
The query will take a table from the list, query both databases and compare the data in the table across the 2 DBs.
For the connection to Netezza i am using pyodbc, for the connection to Oracle i am using cx_oracle, and for the connection to DB2 i am using ibm_db.
At the moment i was able to make a connection to each and i was able to return the column metadata of the table in each db as well as a result set from each.
There are a few things i am trying to accomplish.
If the column is of a certain data type (i.e. decimal, integer) i want to sum all the values for that column in the table, if it is of any other datatype (i.e. string, date) i want to count do a count().
I would like to do this for the table in both DBs and then do a comparison of the column counts/totals and display the comparison in excel.
Finally i would like to do a column by column comparison of every row in the table in both DBs. If there are any differences in the field values for each row then the entire row will be displayed in an excel spreadsheet.
What i am wondering is if there are any packages in python that i can use to perform these table like operations.
Please see the code below for what i have so far.
import pyodbc
import ibm_db
import cx_Oracle
import collections
class DatabaseConnection(object):
def __init__(self, connection_type, hostname_or_ip, port, database_or_sid, username, password):
self.port = port
self.connection_type = connection_type
self.hostname_or_ip = hostname_or_ip
self.database_or_sid = database_or_sid
self.username = username
self.password = password
self.dsn = "GEMPROD"
self.connection_string = ""
self.conn = ""
def __enter__(self):
if self.connection_type == "Netezza":
self.connection_string = "DRIVER={NetezzaSQL};SERVER=" + self.hostname_or_ip + ";PORT="+ self.port + \
";DATABASE=" + self.database_or_sid + ";UID=" + self.username + ";PWD=" + self.password
self.conn = pyodbc.connect(self.connection_string)
return self.conn
elif self.connection_type == "Oracle":
dsn_tns = cx_Oracle.makedsn(self.hostname_or_ip, self.port, self.database_or_sid)
self.conn = cx_Oracle.connect(user=self.username, password=self.password, dsn=dsn_tns)
return self.conn
elif self.connection_type == "DB2":
self.connection_string = "Database=" + self.database_or_sid + ";HOSTNAME=" + self.hostname_or_ip + \
";PORT=" + self.port + ";PROTOCOL=TCPIP;UID=" + self.username + ";PWD=" + \
self.password + ";"
#self.conn = ibm_db.connect(self.connection_string, "", "")
self.conn = ibm_db.connect('DSN=' + self.dsn, self.username, self.password)
return self.conn
pass
def __exit__(self, type, value, traceback):
if self.connection_type == "Netezza":
self.conn.close()
elif self.connection_type == "DB2":
ibm_db.close(self.conn)
elif self.connection_type == "Oracle":
self.conn.close
pass
def __repr__(self):
return '%s%s' % (self.__class__.__name__, self.dsn)
def query(self, query, params):
pass
#database_column_metadata = collections.namedtuple('DatabaseColumnMetadata','index column_name data_type')
#database_field = collections.namedtuple('', '')
table_list = ['BNR_CIF_25DAY_RPT', table2]
sort_column = None
with DatabaseConnection('Netezza', ip, port, database, username, pwd) as connection_one:
print('Netezza Query:')
for table in table_list:
cursor = connection_one.cursor()
netezza_rows = cursor.execute("SELECT * FROM BNR_CIF_25DAY_RPT LIMIT 1")
column_list = netezza_rows.description
sort_column = str(column_list[0][0])
netezza_query = "SELECT * FROM BNR_CIF_25DAY_RPT ORDER BY " + sort_column + " ASC LIMIT 10"
netezza_rows = cursor.execute(netezza_query)
print(column_list)
netezza_column_list = []
for idx, column in enumerate(column_list):
column_name, data_type, *rest = column
netezza_column_list.append((idx, column_name, data_type))
for row in netezza_rows:
print(row, end='\n')
for tup in netezza_column_list:
print(tup, end='\n')
print('Netezza row count:', str(netezza_rows.rowcount) + '\n')
cursor.close()
with DatabaseConnection('Oracle', hostname, port, SID, username, pwd) as connection_two:
print('Oracle Query:')
for table in table_list:
try:
cursor = connection_two.cursor()
oracle_rows = cursor.execute("SELECT * FROM BNR_CIF_25DAY_RPT WHERE ROWNUM <= 1")
column_list = oracle_rows.description
sort_column = column_list[0][0]
oracle_query = "SELECT * FROM (SELECT * FROM BNR_CIF_25DAY_RPT ORDER BY " + sort_column + " ASC) WHERE ROWNUM <=10"
oracle_rows = cursor.execute(oracle_query)
print(column_list)
oracle_column_list = []
for idx, column in enumerate(column_list):
column_name, data_type, *rest = column
oracle_column_list.append((idx, column_name, data_type))
for row in oracle_rows:
print(row, end='\n')
for tup in oracle_column_list:
print(tup, end='\n')
print('Oracle row count:', str(oracle_rows.rowcount) + '\n')
except cx_Oracle.DatabaseError as e:
print(str(e))
finally:
cursor.close()
Apologize for anything that didnt make sense and the poor code as i am new to Python and program is still in it's infancy.
This is not exactly python based solution but we used to do it in our shop to compare netezza and Oracle using fluid query . Fluid query
Hey Every One i am developing a Shiny Application, where we Extract a data from sql Server through ODBC Connector by selecting Date to and from in a Application. i am unable to identify where the issue is because if i execute the code independently on R studio i am able to extract the data from sql Server But then when the same code is executed in Shiny Environment i am unable to achieve the data on shiny here is the below Kindly Guide me on this Thank you.
# ---------------------ui Code -----------------------------
library(shiny)
shinyUI(pageWithSidebar(
headerPanel("Time Analytics"),
sidebarPanel(
dateRangeInput(inputId = "dateRange",
label = "Date range",
start = "2007-09-17",
max = Sys.Date()
)
),#sidebar Panel Ends
# 09-Main Panel ----
mainPanel(
tabsetPanel(id ="theTabs",
tabPanel("Summary", dataTableOutput("tabi"),textOutput("tabii"))
)
)#Main Panel Ends
))
#------------------Server ----------------------------------
library(shiny);library(sqldf)
library(plyr);library(RODBC)
library(ggplot2)
#Creating the connection
shinyServer(function(input, output, session){ # pass in a session argument
# prep data once and then pass around the program
passData <- reactive({
ch = odbcConnect("Test")
#qry <- "SELECT * FROM Nifty50"
#qry <- cat("SELECT * FROM Nifty50 WHERE Date >= ",as.date(input$dateRange[1])," AND Date <= ",input$dateRange[2])
qry <- paste("SELECT * FROM Nifty50 WHERE Date >= ",input$dateRange[1]," AND Date <= ",input$dateRange[2])
#paste("SELECT * FROM Nifty50 WHERE Date >= ",input$dateRange[1]," AND Date <= ",input$dateRange[2])
subset_Table <- sqlQuery(ch,qry)
odbcClose(ch)
subset_Table <- as.data.frame(subset_Table)
return(subset_Table)
})
output$tabi <- renderDataTable({
d<- as.data.frame(passData())
d
})
output$tabii <- renderText({
paste("Minimium Data :",input$dateRange[1], "Max Date:",input$dateRange[2])
})
# ----------------------------------------------------End
})
Here the task is i need to fetch the data from selected Table on the bases of Date to and From criteria, which will be the subset data as per the selected Date from shiny app.
Modify qry as follows:
qry <- paste("SELECT * FROM Nifty50 WHERE Date >= '", input$dateRange[1], "' AND Date <= '", input$dateRange[2], "'", sep = "")