I developing a project developing with Scala, Play framework 2.3.8, Anorm and works with MS SQL database.
Here the code:
val repeatedCardsQuery = SQL("Execute Forms.getListOfRepeatCalls {user}, {cardId}")
DB.withConnection { implicit c =>
val result = repeatedCardsQuery.on("user" -> user.name, "cardId" -> id)().map(row =>
Json.obj(
"Unified_CardNumber" -> row[Long]("scId"),
"ContentSituation_TypeSituationName" -> row[String]("typeOfEventsName"),
"Unified_Date" -> row[Date]("creationDate"),
"InfoPlaceSituation_Address" -> row[String]("address"),
"DescriptionSituation_DescriptionSituation" -> row[Option[String]]("description")
)
).toList
Ok(response(id, "repeats", result))
}
And it gives me runtime error:
play.api.Application$$anon$1: Execution exception[[RuntimeException: Left(TypeDoesNotMatch(Cannot convert 2015-02-14 15:38:15.4089363 +03:00: class microsoft.sql.DateTimeOffset to Date for column ColumnName(.creationDate,Some(creationDate))))]]
at play.api.Application$class.handleError(Application.scala:296) ~[play_2.10-2.3.8.jar:2.3.8]
at play.api.DefaultApplication.handleError(Application.scala:402) [play_2.10-2.3.8.jar:2.3.8]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:205) [play_2.10-2.3.8.jar:2.3.8]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$14$$anonfun$apply$1.applyOrElse(PlayDefaultUpstreamHandler.scala:202) [play_2.10-2.3.8.jar:2.3.8]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) [scala-library.jar:na]
Caused by: java.lang.RuntimeException: Left(TypeDoesNotMatch(Cannot convert 2015-02-14 15:38:15.4089363 +03:00: class microsoft.sql.DateTimeOffset to Date for column ColumnName(.creationDate,Some(creationDate))))
at anorm.MayErr$$anonfun$get$1.apply(MayErr.scala:34) ~[anorm_2.10-2.3.8.jar:2.3.8]
at anorm.MayErr$$anonfun$get$1.apply(MayErr.scala:33) ~[anorm_2.10-2.3.8.jar:2.3.8]
at scala.util.Either.fold(Either.scala:97) ~[scala-library.jar:na]
at anorm.MayErr.get(MayErr.scala:33) ~[anorm_2.10-2.3.8.jar:2.3.8]
at anorm.Row$class.apply(Row.scala:57) ~[anorm_2.10-2.3.8.jar:2.3.8]
[error] application - Left(TypeDoesNotMatch(Cannot convert 2015-02-14 15:38:15.4089363 +03:00: class microsoft.sql.DateTimeOffset to Date for column ColumnName(.creationDate,Some(creationDate))))
Actually, error occurs in this line:
"Unified_Date" -> row[Date]("creationDate"),
How solve that problem, and convert microsoft.sql.DateTimeOffset to Date in this case?
A custom column conversion can be added next to your Anorm call (defined or imported in same class/object which needed to support such type).
import java.util.Date
import microsoft.sql.DateTimeOffset
import anorm.Column
implicit def columnToDate: Column[Date] = Column.nonNull { (value, meta) =>
val MetaDataItem(qualified, nullable, clazz) = meta
value match {
case ms: DateTimeOffset =>
Right(ms.getTimestamp)
case _ =>
Left(TypeDoesNotMatch(s"Cannot convert $value: ${value.asInstanceOf[AnyRef].getClass} to Boolean for column $qualified"))
}
}
Btw it would be much better to work with DB values represented as JDBC dates (java.sql.Date types supported by Anorm).
Related
I get this error on using Npgsql Binary copy (see my implementation at the bottom):
invalid sign in external "numeric" value
The RetailCost column on which it fails has the following PostgreSQL properties:
type: numeric(19,2)
nullable: not null
storage: main
default: 0.00
The PostgreSQL log looks like this:
2019-07-15 13:24:05.131 ACST [17856] ERROR: invalid sign in external "numeric" value
2019-07-15 13:24:05.131 ACST [17856] CONTEXT: COPY products_temp, line 1, column RetailCost
2019-07-15 13:24:05.131 ACST [17856] STATEMENT: copy products_temp (...) from stdin (format binary)
I don't think it should matter, but there are only zero or positive RetailCost values (no negative or null ones)
My implementation looks like this:
using (var importer = conn.BeginBinaryImport($"copy {tempTableName} ({dataColumns}) from stdin (format binary)"))
{
foreach (var product in products)
{
importer.StartRow();
importer.Write(product.ManufacturerNumber, NpgsqlDbType.Text);
if (product.LastCostDateTime == null)
importer.WriteNull();
else
importer.Write((DateTime)product.LastCostDateTime, NpgsqlDbType.Timestamp);
importer.Write(product.LastCost, NpgsqlDbType.Numeric);
importer.Write(product.AverageCost, NpgsqlDbType.Numeric);
importer.Write(product.RetailCost, NpgsqlDbType.Numeric);
if (product.TaxPercent == null)
importer.WriteNull();
else
importer.Write((decimal)product.TaxPercent, NpgsqlDbType.Numeric);
importer.Write(product.Active, NpgsqlDbType.Boolean);
importer.Write(product.NumberInStock, NpgsqlDbType.Bigint);
}
importer.Complete();
}
Any suggestions would be welcome
I caused this problem when using MapBigInt, not realizing the column was accidentally numeric(18). Changing the column to bigint fixed my problem.
The cause of the problem was that the importer.Write statements was not in the same order as the dataColumns
I would like to use Slick (3.2.3) to connect to a MSSQL database.
Currently, my project is the following.
In application.conf, I have
somedbname = {
driver = "slick.jdbc.SQLServerProfile$"
db {
host = "somehost"
port = "someport"
databaseName = "Recupel.Datawarehouse"
url = "jdbc:sqlserver://"${somedbname.db.host}":"${somedbname.db.port}";databaseName="${somedbname.db.databaseName}";"
user = "someuser"
password = "somepassword"
}
}
The "somehost" looks like XX.X.XX.XX where X's are numbers.
My build.sbt contains
name := "test-slick"
version := "0.1"
scalaVersion in ThisBuild := "2.12.7"
libraryDependencies ++= Seq(
"com.typesafe.slick" %% "slick" % "3.2.3",
"com.typesafe.slick" %% "slick-hikaricp" % "3.2.3",
"org.slf4j" % "slf4j-nop" % "1.6.4",
"com.microsoft.sqlserver" % "mssql-jdbc" % "7.0.0.jre10"
)
The file with the "main" object contains
import slick.basic.DatabaseConfig
import slick.jdbc.JdbcProfile
import slick.jdbc.SQLServerProfile.api._
import scala.concurrent.Await
import scala.concurrent.duration._
val dbConfig: DatabaseConfig[JdbcProfile] = DatabaseConfig.forConfig("somedbname")
val db: JdbcProfile#Backend#Database = dbConfig.db
def main(args: Array[String]): Unit = {
try {
val future = db.run(sql"SELECT * FROM somettable".as[(Int, String, String, String, String,
String, String, String, String, String, String, String)])
println(Await.result(future, 10.seconds))
} finally {
db.close()
}
}
}
This, according to all the documentation that I could find, should be enough to connect to the database. However, when I run this, I get
[error] (run-main-0) java.sql.SQLTransientConnectionException: somedbname.db - Connection is not available, request timed out after 1004ms.
[error] java.sql.SQLTransientConnectionException: somedbname.db - Connection is not available, request timed out after 1004ms.
[error] at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:548)
[error] at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:186)
[error] at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:145)
[error] at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:83)
[error] at slick.jdbc.hikaricp.HikariCPJdbcDataSource.createConnection(HikariCPJdbcDataSource.scala:14)
[error] at slick.jdbc.JdbcBackend$BaseSession.<init>(JdbcBackend.scala:453)
[error] at slick.jdbc.JdbcBackend$DatabaseDef.createSession(JdbcBackend.scala:46)
[error] at slick.jdbc.JdbcBackend$DatabaseDef.createSession(JdbcBackend.scala:37)
[error] at slick.basic.BasicBackend$DatabaseDef.acquireSession(BasicBackend.scala:249)
[error] at slick.basic.BasicBackend$DatabaseDef.acquireSession$(BasicBackend.scala:248)
[error] at slick.jdbc.JdbcBackend$DatabaseDef.acquireSession(JdbcBackend.scala:37)
[error] at slick.basic.BasicBackend$DatabaseDef$$anon$2.run(BasicBackend.scala:274)
[error] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
[error] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[error] at java.base/java.lang.Thread.run(Thread.java:844)
[error] Nonzero exit code: 1
Perhaps related and also annoying, when I run this code for the second (and subsequent) times, I get the following error instead:
Failed to get driver instance for jdbcUrl=jdbc:sqlserver://[...]
which forces me to kill and reload sbt each time.
What am I doing wrong? Worth noting: I can connect to the database with the same credential from a software like valentina.
As suggested by #MarkRotteveel, and following this link, I found a solution.
First, I explicitly set the driver, adding the line
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
in the db dictionary, after password = "somepassword".
Secondly, the default timeout (after one second) appears to be too short for my purposes, and therefore I added the line
connectionTimeout = "30 seconds"
after the previous driver line, still in the db dictionary.
Now it works.
Getting an error in executing the code. I have a datastore entity which has a property of type Date. An example date property value stored in an entity for a particular row is 2016-01-03 (19:00:00.000) EDT
The code i am executing is filtering the entity values based on date greater than 2016-01-01. Any idea what is wrong with the code
Error
ValueError: Unknown protobuf attr type <type 'datetime.date'>
Code
import pandas as pd
import numpy as np
from datetime import datetime
from google.cloud import datastore
from flask import Flask,Blueprint
app = Flask(__name__)
computation_cron= Blueprint('cron.stock_data_transformation', __name__)
#computation_cron.route('/cron/stock_data_transformation')
def cron():
ds = datastore.Client(project="earningspredictor-173913")
query = ds.query(kind='StockPrice')
query.add_filter('date', '>', datetime.strptime("2016-01-01", '%Y-%m-%d').date())
dataframe_data = []
temp_dict = {}
for q in query.fetch():
temp_dict["stock_code"] = q["stock_code"]
temp_dict["date"] = q["date"]
temp_dict["ex_dividend"] = q["ex_dividend"]
temp_dict["split_ratio"] = q["split_ratio"]
temp_dict["adj_open"] = q["adj_open"]
temp_dict["adj_high"] = q["adj_high"]
temp_dict["adj_low"] = q["adj_low"]
temp_dict["adj_close"] = q["adj_close"]
temp_dict["adj_volume"] = q["adj_volume"]
dataframe_data.append(temp_dict)
sph = pd.DataFrame(data=dataframe_data,columns=temp_dict.keys())
# print sph.to_string()
query = ds.query(kind='EarningsSurprise')
query.add_filter('act_rpt_date', '>', datetime.strptime("2016-01-01", '%Y-%m-%d').date())
dataframe_data = []
temp_dict = {}
for q in query.fetch():
temp_dict["stock_code"] = q["stock_code"]
temp_dict["eps_amount_diff"] = q["eps_amount_diff"]
temp_dict["eps_actual"] = q["eps_actual"]
temp_dict["act_rpt_date"] = q["act_rpt_date"]
temp_dict["act_rpt_code"] = q["act_rpt_code"]
temp_dict["eps_percent_diff"] = q["eps_percent_diff"]
dataframe_data.append(temp_dict)
es = pd.DataFrame(data=dataframe_data,columns=temp_dict.keys())
You seem to be using the generic google-cloud-datastore client library, not the NDB Client Library.
For google-cloud-datastore all date and/or time properties have the same format. From Date and time:
JSON
field name: timestampValue
type: string (RFC 3339 formatted, with milliseconds, for instance 2013-05-14T00:01:00.234Z)
Protocol buffer
field name: timestamp_value
type: Timestamp
Sort order: Chronological
Notes: When stored in Cloud Datastore, precise only to microseconds; any additional precision is rounded down.
So when setting/comparing such properties try to use strings formatted as specified (or integers for protobuf Timestamp?), not directly objects from the datetime modules (which work with the NDB library). The same might be true for queries as well.
Note: this is based on documentation only, I didn't use the generic library myself.
I have a dataframe df that contains one column of type array
df.show() looks like
|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D] |22 | F |
|2 | [A,Y] |42 | M |
|3 | [X] |60 | F |
+--+-------------+---+------+
I try to dump that df in a csv file as follow:
val dumpCSV = df.write.csv(path="/home/me/saveDF")
It is not working because of the column ArrayOfString. I get the error:
CSV data source does not support array string data type
The code works if I remove the column ArrayOfString. But I need to keep ArrayOfString!
What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)
The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.
Try the following :
import org.apache.spark.sql.functions._
val stringify = udf((vs: Seq[String]) => vs match {
case null => null
case _ => s"""[${vs.mkString(",")}]"""
})
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
or
import org.apache.spark.sql.Column
def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
Pyspark implementation.
In this example, change the field column_as_array to column_as_string before saving.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))
Then you can drop the old column (array type) before saving.
df.drop("column_as_array").write.csv(...)
No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:
import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
.write
.csv(path="/home/me/saveDF")
Hope that helps.
Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:
def stringifyArrays(dataFrame: DataFrame): DataFrame = {
val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
colsToStringify.foldLeft(dataFrame)((df, c) => {
df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
})
}
Also, it doesn't use a UDF.
CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution.
case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)
df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")
To answer DreamerP's question (from one of the comments) :
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
df = df.drop("Consequent")
df = df.drop("Antecedent")
df.write.csv("foldername")
I have two streams:
Measurements
WhoMeasured (metadata about who took the measurement)
These are the case classes for them:
case class Measurement(var value: Int, var who_measured_id: Int)
case class WhoMeasured(var who_measured_id: Int, var name: String)
The Measurement stream has a lot of data. The WhoMeasured stream has little. In fact, for each who_measured_id in the WhoMeasured stream, only 1 name is relevant, so old elements can be discarded if one with the same who_measured_id arrives. This is essentially a HashTable that gets filled by the WhoMeasured stream.
In my custom window function
class WFunc extends WindowFunction[Measurement, Long, Int, TimeWindow] {
override def apply(key: Int, window: TimeWindow, input: Iterable[Measurement], out: Collector[Long]): Unit = {
// Here I need access to the WhoMeasured stream to get the name of the person who took a measurement
// The following two are equivalent since I keyed by who_measured_id
val name_who_measured = magic(key)
val name_who_measured = magic(input.head.who_measured_id)
}
}
This is my job. Now as you might see, there is something missing: The combination of the two streams.
val who_measured_stream = who_measured_source
.keyBy(w => w.who_measured_id)
.countWindow(1)
val measurement_stream = measurements_source
.keyBy(m => m.who_measured_id)
.timeWindow(Time.seconds(60), Time.seconds(5))
.apply(new WFunc)
So in essence this is sort of a lookup table that gets updated when new elements in the WhoMeasured stream arrive.
So the question is: How to achieve such a lookup from one WindowedStream into another?
Follow Up:
After implementing in the way Fabian suggested, the job always fails with some sort of serialization issue:
[info] Loading project definition from /home/jgroeger/Code/MeasurementJob/project
[info] Set current project to MeasurementJob (in build file:/home/jgroeger/Code/MeasurementJob/)
[info] Compiling 8 Scala sources to /home/jgroeger/Code/MeasurementJob/target/scala-2.11/classes...
[info] Running de.company.project.Main dev MeasurementJob
[error] Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: The implementation of the RichCoFlatMapFunction is not serializable. The object probably contains or references non serializable fields.
[error] at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:100)
[error] at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:1478)
[error] at org.apache.flink.streaming.api.datastream.DataStream.clean(DataStream.java:161)
[error] at org.apache.flink.streaming.api.datastream.ConnectedStreams.flatMap(ConnectedStreams.java:230)
[error] at org.apache.flink.streaming.api.scala.ConnectedStreams.flatMap(ConnectedStreams.scala:127)
[error] at de.company.project.jobs.MeasurementJob.run(MeasurementJob.scala:139)
[error] at de.company.project.Main$.main(Main.scala:55)
[error] at de.company.project.Main.main(Main.scala)
[error] Caused by: java.io.NotSerializableException: de.company.project.jobs.MeasurementJob
[error] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
[error] at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
[error] at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
[error] at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
[error] at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
[error] at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
[error] at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:301)
[error] at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:81)
[error] ... 7 more
java.lang.RuntimeException: Nonzero exit code returned from runner: 1
at scala.sys.package$.error(package.scala:27)
[trace] Stack trace suppressed: run last MeasurementJob/compile:run for the full output.
[error] (MeasurementJob/compile:run) Nonzero exit code returned from runner: 1
[error] Total time: 9 s, completed Nov 15, 2016 2:28:46 PM
Process finished with exit code 1
The error message:
The implementation of the RichCoFlatMapFunction is not serializable. The object probably contains or references non serializable fields.
However, the only field my JoiningCoFlatMap has is the suggested ValueState.
The signature looks like this:
class JoiningCoFlatMap extends RichCoFlatMapFunction[Measurement, WhoMeasured, (Measurement, String)] {
I think what you want to do is a window operation followed by a join.
You can implement the a join of a high-volume stream and a low-value update-by-key stream using a stateful CoFlatMapFunction as in the example below:
val measures: DataStream[Measurement] = ???
val who: DataStream[WhoMeasured] = ???
val agg: DataStream[(Int, Long)] = measures
.keyBy(_._2) // measured_by_id
.timeWindow(Time.seconds(60), Time.seconds(5))
.apply( (id: Int, w: TimeWindow, v: Iterable[(Int, Int, String)], out: Collector[(Int, Long)]) => {
// do your aggregation
})
val joined: DataStream[(Int, Long, String)] = agg
.keyBy(_._1) // measured_by_id
.connect(who.keyBy(_.who_measured_id))
.flatMap(new JoiningCoFlatMap)
// CoFlatMapFunction
class JoiningCoFlatMap extends RichCoFlatMapFunction[(Int, Long), WhoMeasured, (Int, Long, String)] {
var names: ValueState[String] = null
override def open(conf: Configuration): Unit = {
val stateDescrptr = new ValueStateDescriptor[String](
"whoMeasuredName",
classOf[String],
"" // default value
)
names = getRuntimeContext.getState(stateDescrptr)
}
override def flatMap1(a: (Int, Long), out: Collector[(Int, Long, String)]): Unit = {
// join with state
out.collect( (a._1, a._2, names.value()) )
}
override def flatMap2(w: WhoMeasured, out: Collector[(Int, Long, String)]): Unit = {
// update state
names.update(w.name)
}
}
A note on the implementation: A CoFlatMapFunction cannot decide which input to process, i.e., the flatmap1 and flatmap2 functions are called depending on what data arrives at the operator. It cannot be controlled by the function. This is a problem when initializing the state. In the beginning, the state might not have the correct name for an arriving Measurement object but return the default value. You can avoid that by buffering the measurements and joining them once, the first update for the key from the who stream arrives. You'll need another state for that.