Scala, Slick connect to MSSQL server - sql-server

I am trying to connect to a MSSQL database using the slick framework. The following code shows my first attempt but I can't figure out what is wrong.
This error occurs when leaving it as shown below:
[1] value create is not a member of scala.slick.lifted.DDL
Now I delete the line because I do not necessarily need to create the table within my scala code. But then another error arises:
[2] value map is not a member of object asd.asd.App.Coffees
package asd.asd
import scala.slick.driver.SQLServerDriver._
import scala.slick.session.Database.threadLocalSession
object App {
object Coffees extends Table[(String, Int, Double)]("COFFEES") {
def name = column[String]("COF_NAME", O.PrimaryKey)
def supID = column[Int]("SUP_ID")
def price = column[Double]("PRICE")
def * = name ~ supID ~ price
}
def main(args : Array[String]) {
println( "Hello World!" )
val db = slick.session.Database.forURL(url = "jdbc:jtds:sqlserver", user = "test", password = "test", driver = "scala.slick.driver.SQLServerDriver")
db withSession {
Coffees.ddl.create [1]
// Coffees.insertAll(
// ("Colombian", 101, 7.99),
// ("Colombian_Decaf", 101, 8.99),
// ("French_Roast_Decaf", 49, 9.99)
// )
val q = for {
c <- Coffees [2]
} yield (c.name, c.price, c.supID)
println(q.selectStatement)
q.foreach { case (n, p, s) => println(n + ": " + p) }
}
}
}

Problem solved. What I did was the following: Update to the latest Slick version and then adjust the code like demonstrated here. Afterwards you need to exchange the line
import scala.slick.driver.H2Driver.simple._
with
import scala.slick.driver.SQLServerDriver.simple._
And modify the connect string to
[...]
Database.forURL("jdbc:jtds:sqlserver://localhost:1433/<DB>;instance=<INSTANCE>", driver = "scala.slick.driver.SQLServerDriver") withSession {
[...]
After this worked out, I decided to use a c3p0 pooled connection (Which makes Slick very much faster and thus first usable, I do highly recommend to use connection pooling!). This left me with the following database object.
package utils
import scala.slick.driver.SQLServerDriver.simple._
import com.mchange.v2.c3p0.ComboPooledDataSource
object DatabaseUtils {
private val ds = new ComboPooledDataSource
ds.setDriverClass("scala.slick.driver.SQLServerDriver")
ds.setUser("supervisor")
ds.setPassword("password1")
ds.setMaxPoolSize(20)
ds.setMinPoolSize(3)
ds.setTestConnectionOnCheckin(true)
ds.setIdleConnectionTestPeriod(300)
ds.setMaxIdleTimeExcessConnections(240)
ds.setAcquireIncrement(1)
ds.setJdbcUrl("jdbc:jtds:sqlserver://localhost:1433/db_test;instance=SQLEXPRESS")
ds.setPreferredTestQuery("SELECT 1")
private val _database = Database.forDataSource(ds)
def database = _database
}
You can use this like displayed below.
DatabaseUtils.database withSession {
implicit session =>
[...]
}
Last the Maven dependency for c3p0 and the latest Slick version.
<dependency>
<groupId>com.typesafe.slick</groupId>
<artifactId>slick_2.10</artifactId>
<version>2.0.0-M3</version>
</dependency>
<dependency>
<groupId>com.mchange</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.2.1</version>
</dependency>

Related

GenericRowWithSchema ClassCastException in Spark 3 Scala UDF for Array data

I am writing a Spark 3 UDF to mask an attribute in an Array field.
My data (in parquet, but shown in a JSON format):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
case class:
case class MyClass(conditions: Seq[MyItem])
case class MyItem(code: String, category: String)
Spark code:
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val ds = rdd.toDF().as[MyClass]
val maskedConditions: Column = updateArray.apply(col("conditions"))
ds.withColumn("conditions", maskedConditions)
.select("conditions")
.show(2)
Tried the following UDF function.
UDF code:
def updateArray = udf((arr: Seq[MyItem]) => {
for (i <- 0 to arr.size - 1) {
// Line 3
val a = arr(i).asInstanceOf[org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema]
val a = arr(i)
println(a.getAs[MyItem](0))
// TODO: How to make code = "XXXX" here
// a.code = "XXXX"
}
arr
})
Goal:
I need to set 'code' field value in each array item to "XXXX" in a UDF.
Issue:
I am unable to modify the array fields.
Also I get the following error if remove the line 3 in the UDF (cast to GenericRowWithSchema).
Error:
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to MyItem
Question: How to capture Array of Structs in a function and how to return a modified array of items?
Welcome to Stackoverflow!
There is a small json linting error in your data: I assumed that you wanted to close the [] square brackets of the list array. So, for this example I used the following data (which is the same as yours):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
You don't need UDFs for this: a simple map operation will be sufficient! The following code does what you want:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyElement(element: MyItem)
case class MyList(list: Seq[MyElement])
case class MyClass(conditions: MyList)
val df = spark.read.json("./someData.json").as[MyClass]
val transformedDF = df.map{
case (MyClass(MyList(list))) => MyClass(MyList(list.map{
case (MyElement(item)) => MyElement(MyItem(code = "XXXX", item.category))
}))
}
transformedDF.show(false)
+--------------------------------+
|conditions |
+--------------------------------+
|[[[[XXXX, ABC]], [[XXXX, EDC]]]]|
+--------------------------------+
As you see, we're doing some simple pattern matching on the case classes we've defined and successfully renaming all of the code fields' values to "XXXX". If you want to get a json back, you can call the to_json function like so:
transformedDF.select(to_json($"conditions")).show(false)
+----------------------------------------------------------------------------------------------------+
|structstojson(conditions) |
+----------------------------------------------------------------------------------------------------+
|{"list":[{"element":{"code":"XXXX","category":"ABC"}},{"element":{"code":"XXXX","category":"EDC"}}]}|
+----------------------------------------------------------------------------------------------------+
Finally a very small remark about the data. If you have any control over how the data gets made, I would add the following suggestions:
The conditions JSON object seems to have no function in here, since it just contains a single array called list. Consider making the conditions object the array, which would allow you to discard the list name. That would simpify your structure
The element object does nothing, except containing a single item. Consider removing 1 level of abstraction there too.
With these suggestions, your data would contain the same information but look something like:
{"conditions":[{"code":"1234","category":"ABC"},{"code":"4550","category":"EDC"}]}
With these suggestions, you would also remove the need of the MyElement and the MyList case classes! But very often we're not in control over what data we receive so this is just a small disclaimer :)
Hope this helps!
EDIT: After your addition of simplified data according to the above suggestions, the task gets even easier. Again, you only need a map operation here:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyClass(conditions: Seq[MyItem])
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
val df = data.toDF.as[MyClass]
val transformedDF = df.map{
case MyClass(conditions) => MyClass(conditions.map{
item => MyItem("XXXX", item.category)
})
}
transformedDF.show(false)
+--------------------------+
|conditions |
+--------------------------+
|[[XXXX, ABC], [XXXX, EDC]]|
+--------------------------+
I am able to find a simple solution with Spark 3.1+ as new features are added in this new Spark version.
Updated code:
val data = Seq(
MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("234", "KBC"))),
MyClass(conditions = Seq(MyItem("4550", "DTC"), MyItem("900", "RDT")))
)
import spark.implicits._
val ds = data.toDF()
val updatedDS = ds.withColumn(
"conditions",
transform(
col("conditions"),
x => x.withField("code", updateArray(x.getField("code")))))
updatedDS.show()
UDF:
def updateArray = udf((oldVal: String) => {
if(oldVal.contains("1234"))
"XXX"
else
oldVal
})

I am getting an error in Kapt Debug Kotlin. I have update versions of dependencies in gradle file. still facing this issue

My app was running smoothly but I am getting this error now.I am getting an error in Kapt Debug Kotlin. I have update versions of dependencies in gradle file. still facing this issue. How it can be resolved? I saw somewhere to see your room database , dao and data class. still not able to figure out what is the issue.
The error is showing this file
ROOM DATABASE
#Database(entities = [Transaction::class], version = 1, exportSchema = false)
abstract class MoneyDatabase : RoomDatabase(){
abstract fun transactionListDao():transactionDetailDao
companion object {
// Singleton prevents multiple instances of database opening at the
// same time.
#Volatile
private var INSTANCE: MoneyDatabase? = null
fun getDatabase(context: Context): MoneyDatabase {
// if the INSTANCE is not null, then return it,
// if it is, then create the database
return INSTANCE ?: synchronized(this) {
val instance = Room.databaseBuilder(
context.applicationContext,
MoneyDatabase::class.java,
"transaction_database"
).build()
INSTANCE = instance
// return instance
instance
}
}
}
}
DAO
#Dao
interface transactionDetailDao {
#Insert(onConflict = OnConflictStrategy.IGNORE)
suspend fun insert(transaction : Transaction)
#Delete
suspend fun delete(transaction : Transaction)
#Update
suspend fun update(transaction: Transaction)
#Query("SELECT * FROM transaction_table ORDER BY id ASC")
fun getalltransaction(): LiveData<List<Transaction>>
}
DATA CLASS
enum class Transaction_type(){
Cash , debit , Credit
}
enum class Type(){
Income, Expense
}
#Entity(tableName = "transaction_table")
data class Transaction(
val name : String,
val amount : Float,
val day : Int,
val month : Int,
val year : Int,
val comment: String,
val datePicker: String,
val transaction_type : String,
val category : String,
val recurring_from : String,
val recurring_to : String
){
#PrimaryKey(autoGenerate = true) var id :Long=0
}
The error is resolved. I was using the kotlin version 1.6.0. I reduced it to 1.4.32. As far as I understood, above(latest) version of Kotlin along with Room and coroutines doesn’t work well.
I believe that your issue is due to the use of an incorrect class being inadvertently used, a likely culprit being Transaction as that it also a Room class
Perhaps in transactionDetailDao (although it might be elsewhere)
See if you have import androidx.room.Transaction? (or any other imports with Transaction)?
If so delete or comment out the import
As an example with, and :-
And with the import commented out :-
Imported from github, had a play issue definitely appears to be with co-routines. commented out suspends in the Dao :-
#Dao
interface transactionDetailDao {
#Insert(onConflict = OnConflictStrategy.IGNORE)
suspend fun insert(transaction : Transaction)
#Delete
suspend fun delete(transaction : Transaction)
#Update
suspend fun update(transaction: Transaction)
#Query("SELECT * FROM transaction_table ORDER BY id ASC")
fun getalltransaction(): LiveData<List<Transaction>>
}
Compiled ok and ran and had a play e.g. :-

super confused with table and dataset or datastream conversion

I am using Flink 1.12, and I am super confused with when table and dataset/datastream conversion can be performed.
In the following code, I want to print the table content to the console, and I tried the following 3 ways
,all of them throws exception
table.toDataSet[Row].print()
table.toAppendStream[Row].print()
table.print()
I would ask how to print the table content to the console,eg, using the print method
import org.apache.flink.api.scala._
import org.apache.flink.table.api.bridge.scala._
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, TableEnvironment, TableResult}
import org.apache.flink.table.descriptors.{Csv, FileSystem, Schema}
import org.apache.flink.types.Row
object Sql021_PlannerOldBatchTest {
def main(args: Array[String]): Unit = {
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val env = TableEnvironment.create(settings)
val fmt = new Csv().fieldDelimiter(',').deriveSchema()
val schema = new Schema()
.field("a", DataTypes.STRING())
.field("b", DataTypes.STRING())
.field("c", DataTypes.DOUBLE())
env.connect(new FileSystem().path("D:/stock.csv")).withSchema(schema).withFormat(fmt).createTemporaryTable("sourceTable")
val table = env.sqlQuery("select * from sourceTable")
//ERROR: Only tables that originate from Scala DataSets can be converted to Scala DataSets.
// table.toDataSet[Row].print()
//ERROR:Only tables that originate from Scala DataStreams can be converted to Scala DataStreams.
table.toAppendStream[Row].print()
//ERROR: table doesn't has the print method
// table.print()
}
}
In the streaming case, this will work
tenv.toAppendStream(table, classOf[Row]).print()
env.execute()
and the batch case you can do
val tableResult: TableResult = env.executeSql("select * from sourceTable")
tableResult.print()

Spark streaming nested execution serialization issues

I am trying to connect DB2 database in the spark streaming application and the database query execution statement causing "org.apache.spark.SparkException: Task not serializable" issues. Please advise. Below is the sample code I have for reference.
dataLines.foreachRDD{rdd=>
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
val dataRows=rdd.map(rs => rs.value).map(row =>
row.split(",")(1)-> (row.split(",")(0), row.split(",")(1), row.split(",")(2)
, "cvflds_"+row.split(",")(3).toLowerCase, row.split(",")(4), row.split(",")(5), row.split(",")(6))
)
val db2Conn = getDB2Connection(spark,db2ConParams)
dataRows.foreach{ case (k,v) =>
val table = v._4
val dbQuery = s"(SELECT * FROM $table ) tblResult"
val df=getTableData(db2Conn,dbQuery)
df.show(2)
}
}
Below is other function code:
private def getDB2Connection(spark: SparkSession, db2ConParams:scala.collection.immutable.Map[String,String]): DataFrameReader = {
spark.read.format("jdbc").options(db2ConParams)
}
private def getTableData(db2Con: DataFrameReader,tableName: String):DataFrame ={
db2Con.option("dbtable",tableName).load()
}
object SparkSessionSingleton {
#transient private var instance: SparkSession = _
def getInstance(sparkConf: SparkConf): SparkSession = {
if (instance == null) {
instance = SparkSession
.builder
.config(sparkConf)
.getOrCreate()
}
instance
}
}
Below is the error log:
2018-03-28 22:12:21,487 [JobScheduler] ERROR org.apache.spark.streaming.scheduler.JobScheduler - Error running job streaming job 1522289540000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:915)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:915)
at ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1.apply(DB2DataLoadToKudu.scala:139)
at ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1.apply(DB2DataLoadToKudu.scala:128)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:254)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:253)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.spark.sql.DataFrameReader
Serialization stack:
- object not serializable (class: org.apache.spark.sql.DataFrameReader, value: org.apache.spark.sql.DataFrameReader#15fdb01)
- field (class: ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1$$anonfun$apply$2, name: db2Conn$1, type: class org.apache.spark.sql.DataFrameReader)
- object (class ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1$$anonfun$apply$2, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 30 more
Ideally you should keep the closure in dataRows.foreach clear of any connection objects, since the closure is meant to be serialized to executors and run there. This concept is covered in depth # this official link
In your case below line is the closure that is causing the issue:
val df=getTableData(db2Conn,dbQuery)
So, instead of using spark to get the DB2 table loaded, which in your case becomes(after combining the methods):
spark.read.format("jdbc").options(db2ConParams).option("dbtable",tableName).load()
Use plain JDBC in the closure to achieve this. You can use db2ConParams in the jdbc code. (I assume its simple enough to be serializable). The link also suggests using rdd.foreachPartition and ConnectionPool to further optimize.
You have not mentioned what you are doing with the table data except df.show(2). If the rows are huge, then you may discuss more about your use case. Perhaps, you need to consider a different design then.

How do I execute a count command using scala.dbc?

I am trying to connect to an MS/SQL server and execute a 'count' statement. I've reached this far:
import scala.dbc._
import scala.dbc.Syntax._
import scala.dbc.syntax.Statement._
import java.net.URI
object MsSqlVendor extends Vendor {
val uri = new URI("jdbc:sqlserver://173.248.X.X:Y/DataBaseName")
val user = "XXX"
val pass = "XXX"
val retainedConnections = 5
val nativeDriverClass = Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
val urlProtocolString = "jdbc:sqlserver:"
}
object Main {
def main(args: Array[String]) {
println("Hello, world!")
val db = new Database(MsSqlVendor)
val count = db.executeStatement {
select (count) from (technical)
}
println("%d rows counted", count)
}
}
I get an error saying: "scala.dbc.syntax.Statement.select of type dbc.syntax.Statement.SelectZygote does not take parameters"
How do I set this up?
This can be a problem:
val count = db.executeStatement {
select (count) from (technical)
}
The count inside the statement refers to val count, not to some other count. Still, there's the other problem you report. There's neither a count nor a technical definition anywhere. Maybe it is missing from some other place where you found this snippet. The following does compile, though it's anyone's guess as to whether it does what you want:
val countx = db.executeStatement {
select fields "count" from "technical"
}
At any rate, I thought scala.dbc was long deprecated. I can't find any deprecation notice, however, and it is still linked on the library jar, even on trunk.

Resources