TypeInformation not defined - apache-flink

object EventConsumer {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val data1 = env.readTextFile("file:////some_events.txt");
// Define the data source
data1 .map (new myMapFunction)
}
class myMapFunction extends MapFunction[String,Unit]
{
override def map(in: String): Unit = {
println(in)
}
}
}
Really stuck with this compilation error for a long time, any help please.
Error:(27, 15) could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[String]
flatMap { _.split("\n")}.filter(_.nonEmpty).map (new myMapFunction)
Error:(24, 15) not enough arguments for method map: (implicit evidence$2: org.apache.flink.api.common.typeinfo.TypeInformation[Unit], implicit evidence$3: scala.reflect.ClassTag[Unit])org.apache.flink.api.scala.DataSet[Unit].
Unspecified value parameters evidence$2, evidence$3.
data1.map (new myMapFunction)
^
^

When using Flink's Scala DataSet API it is necessary to add the following import to your code: import org.apache.flink.api.scala._.
When using Flink's Scala DataStream API you have to import import org.apache.flink.streaming.api.scala._.
The reason is that the package object contains a function which generates the missing TypeInformation instances.

Related

GenericRowWithSchema ClassCastException in Spark 3 Scala UDF for Array data

I am writing a Spark 3 UDF to mask an attribute in an Array field.
My data (in parquet, but shown in a JSON format):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
case class:
case class MyClass(conditions: Seq[MyItem])
case class MyItem(code: String, category: String)
Spark code:
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val ds = rdd.toDF().as[MyClass]
val maskedConditions: Column = updateArray.apply(col("conditions"))
ds.withColumn("conditions", maskedConditions)
.select("conditions")
.show(2)
Tried the following UDF function.
UDF code:
def updateArray = udf((arr: Seq[MyItem]) => {
for (i <- 0 to arr.size - 1) {
// Line 3
val a = arr(i).asInstanceOf[org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema]
val a = arr(i)
println(a.getAs[MyItem](0))
// TODO: How to make code = "XXXX" here
// a.code = "XXXX"
}
arr
})
Goal:
I need to set 'code' field value in each array item to "XXXX" in a UDF.
Issue:
I am unable to modify the array fields.
Also I get the following error if remove the line 3 in the UDF (cast to GenericRowWithSchema).
Error:
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to MyItem
Question: How to capture Array of Structs in a function and how to return a modified array of items?
Welcome to Stackoverflow!
There is a small json linting error in your data: I assumed that you wanted to close the [] square brackets of the list array. So, for this example I used the following data (which is the same as yours):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
You don't need UDFs for this: a simple map operation will be sufficient! The following code does what you want:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyElement(element: MyItem)
case class MyList(list: Seq[MyElement])
case class MyClass(conditions: MyList)
val df = spark.read.json("./someData.json").as[MyClass]
val transformedDF = df.map{
case (MyClass(MyList(list))) => MyClass(MyList(list.map{
case (MyElement(item)) => MyElement(MyItem(code = "XXXX", item.category))
}))
}
transformedDF.show(false)
+--------------------------------+
|conditions |
+--------------------------------+
|[[[[XXXX, ABC]], [[XXXX, EDC]]]]|
+--------------------------------+
As you see, we're doing some simple pattern matching on the case classes we've defined and successfully renaming all of the code fields' values to "XXXX". If you want to get a json back, you can call the to_json function like so:
transformedDF.select(to_json($"conditions")).show(false)
+----------------------------------------------------------------------------------------------------+
|structstojson(conditions) |
+----------------------------------------------------------------------------------------------------+
|{"list":[{"element":{"code":"XXXX","category":"ABC"}},{"element":{"code":"XXXX","category":"EDC"}}]}|
+----------------------------------------------------------------------------------------------------+
Finally a very small remark about the data. If you have any control over how the data gets made, I would add the following suggestions:
The conditions JSON object seems to have no function in here, since it just contains a single array called list. Consider making the conditions object the array, which would allow you to discard the list name. That would simpify your structure
The element object does nothing, except containing a single item. Consider removing 1 level of abstraction there too.
With these suggestions, your data would contain the same information but look something like:
{"conditions":[{"code":"1234","category":"ABC"},{"code":"4550","category":"EDC"}]}
With these suggestions, you would also remove the need of the MyElement and the MyList case classes! But very often we're not in control over what data we receive so this is just a small disclaimer :)
Hope this helps!
EDIT: After your addition of simplified data according to the above suggestions, the task gets even easier. Again, you only need a map operation here:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyClass(conditions: Seq[MyItem])
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
val df = data.toDF.as[MyClass]
val transformedDF = df.map{
case MyClass(conditions) => MyClass(conditions.map{
item => MyItem("XXXX", item.category)
})
}
transformedDF.show(false)
+--------------------------+
|conditions |
+--------------------------+
|[[XXXX, ABC], [XXXX, EDC]]|
+--------------------------+
I am able to find a simple solution with Spark 3.1+ as new features are added in this new Spark version.
Updated code:
val data = Seq(
MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("234", "KBC"))),
MyClass(conditions = Seq(MyItem("4550", "DTC"), MyItem("900", "RDT")))
)
import spark.implicits._
val ds = data.toDF()
val updatedDS = ds.withColumn(
"conditions",
transform(
col("conditions"),
x => x.withField("code", updateArray(x.getField("code")))))
updatedDS.show()
UDF:
def updateArray = udf((oldVal: String) => {
if(oldVal.contains("1234"))
"XXX"
else
oldVal
})

super confused with table and dataset or datastream conversion

I am using Flink 1.12, and I am super confused with when table and dataset/datastream conversion can be performed.
In the following code, I want to print the table content to the console, and I tried the following 3 ways
,all of them throws exception
table.toDataSet[Row].print()
table.toAppendStream[Row].print()
table.print()
I would ask how to print the table content to the console,eg, using the print method
import org.apache.flink.api.scala._
import org.apache.flink.table.api.bridge.scala._
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, TableEnvironment, TableResult}
import org.apache.flink.table.descriptors.{Csv, FileSystem, Schema}
import org.apache.flink.types.Row
object Sql021_PlannerOldBatchTest {
def main(args: Array[String]): Unit = {
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val env = TableEnvironment.create(settings)
val fmt = new Csv().fieldDelimiter(',').deriveSchema()
val schema = new Schema()
.field("a", DataTypes.STRING())
.field("b", DataTypes.STRING())
.field("c", DataTypes.DOUBLE())
env.connect(new FileSystem().path("D:/stock.csv")).withSchema(schema).withFormat(fmt).createTemporaryTable("sourceTable")
val table = env.sqlQuery("select * from sourceTable")
//ERROR: Only tables that originate from Scala DataSets can be converted to Scala DataSets.
// table.toDataSet[Row].print()
//ERROR:Only tables that originate from Scala DataStreams can be converted to Scala DataStreams.
table.toAppendStream[Row].print()
//ERROR: table doesn't has the print method
// table.print()
}
}
In the streaming case, this will work
tenv.toAppendStream(table, classOf[Row]).print()
env.execute()
and the batch case you can do
val tableResult: TableResult = env.executeSql("select * from sourceTable")
tableResult.print()

Scala how convert from Array to JSON format

I have to the code:
socket.emit("login",obj, new Ack {
override def call(args: AnyRef*): Unit = {
println(args)
}
})
Console output
WrappedArray({"uid":989,"APILevel":5,"status":"ok"})
How to convert args from WrappedArray to JSON?
There are many json libraries. For example you can take a look at Scala json parsers performance, to see the usage and performance. I'll demonstrate how that can be done using play-json. We need to first create a case class that represents your data model:
case class Model(uid: Int, APILevel: Int, status: String)
Now we need to create a formatter, on the companion object:
object Model {
implicit val format: Format[Model] = Json.format[Model]
}
To create a JsValue from it, you can:
val input: String = "{\"uid\":989,\"APILevel\":5,\"status\":\"ok\"}"
val json: JsValue = Json.parse(input)
and to convert it to the model:
val model: Model = json.as[Model]
A complete running example can be found at Scastie. Just don't forget to add play-json as a dependency, by adding the following to your build.sbt:
resolvers += "play-json" at "https://mvnrepository.com/artifact/com.typesafe.play/play-json"
libraryDependencies += "com.typesafe.play" %% "play-json" % "2.9.1"

Scala play api for JSON - getting Array of some case class from stringified JSON?

From our code, we call some service and get back stringified JSON as a result. The stringified JSON is of an array of "SomeItem", which just has four fields in it - 3 Longs and 1 String
Ex:
[
{"id":33,"count":40000,"someOtherCount":0,"someString":"stuffHere"},
{"id":35,"count":23000,"someOtherCount":0,"someString":"blah"},
...
]
I've been using the play API to read values out using implicit Writes / Reads. But I'm having trouble getting it to work for Arrays.
For example, I've been try to parse the value out of the response, and then convert it to the SomeItem case class array, but it's failing:
val sanityCheckValue: JsValue: Json.parse(response.body)
val Array[SomeItem] = Json.fromJson(sanityCheckValue)
I have
implicit val someItemReads = Json.reads[SomeItem]
But it looks like it's not working. I've tried to set up a Json.reads[Array[SomeItem]] as well, but no luck.
Should this be working? Any tips on how to get this to work?
import play.api.libs.json._
case class SomeItem(id: Long, count: Long, someOtherCount: Long, someString: String)
object SomeItem {
implicit val format = Json.format[SomeItem]
}
object PlayJson {
def main(args: Array[String]): Unit = {
val strJson =
"""
|[
| {"id":33,"count":40000,"someOtherCount":0,"someString":"stuffHere"},
| {"id":35,"count":23000,"someOtherCount":0,"someString":"blah"}
|]
""".stripMargin
val listOfSomeItems: Array[SomeItem] = Json.parse(strJson).as[Array[SomeItem]]
listOfSomeItems.foreach(println)
}
}

Can't use reactivestream Subscriber with akka stream sources

I'm trying to attach a reactivestream subscriber to an akka source.
My source seems to work fine with a simple sink (like a foreach) - but if I put in a real sink, made from a subscriber, I don't get anything.
My context is:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
import org.reactivestreams.{Subscriber, Subscription}
implicit val system = ActorSystem.create("test")
implicit val materializer = ActorMaterializer.create(system)
class PrintSubscriber extends Subscriber[String] {
override def onError(t: Throwable): Unit = {}
override def onSubscribe(s: Subscription): Unit = {}
override def onComplete(): Unit = {}
override def onNext(t: String): Unit = {
println(t)
}
}
and my test case is:
val subscriber = new PrintSubscriber()
val sink = Sink.fromSubscriber(subscriber)
val source2 = Source.fromIterator(() => Iterator("aaa", "bbb", "ccc"))
val source1 = Source.fromIterator(() => Iterator("xxx", "yyy", "zzz"))
source1.to(sink).run()(materializer)
source2.runForeach(println)
I get output:
aaa
bbb
ccc
Why don't I get xxx, yyy, and zzz?
Citing the Reactive Streams specs for the Subscriber below:
Will receive call to onSubscribe(Subscription) once after passing an
instance of Subscriber to Publisher.subscribe(Subscriber). No further
notifications will be received until Subscription.request(long) is
called.
The smallest change you can make to see some items flowing through to your subscriber is
override def onSubscribe(s: Subscription): Unit = {
s.request(3)
}
However, keep in mind this won't make it fully compliant to the Reactive Streams specs. It being not-so-easy to implement is the main reason behind higher level toolkits like Akka-Streams itself.

Resources