Why is Table Schema inference from Scala case classes not working in this official example - apache-flink

I am having trouble with Schema inference from Scala case classes during conversion from DataStreams to Tables in Flink. I've tried reproducing the examples given in the documentation but cannot get them to work. I'm wondering whether this might be a bug?
I have commented on a somewhat related issue in the past. My workaround is not using case classes but defining somewhat laboriously a DataStream[Row] with return type annotations.
Still I would like to learn if it is somehow possible to get the Schema inference from case classes working.
I'm using Flink 1.15.2 with Scala 2.12.7. I'm using the java libraries but install flink-scala separately.
This is my implementation of Example 1 as a quick Sanity Check:
import org.apache.flink.runtime.testutils.MiniClusterResourceConfiguration
import org.apache.flink.test.util.MiniClusterWithClientResource
import org.scalatest.BeforeAndAfter
import org.scalatest.funsuite.AnyFunSuite
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment
import java.time.Instant
class SanitySuite extends AnyFunSuite with BeforeAndAfter {
val flinkCluster = new MiniClusterWithClientResource(
new MiniClusterResourceConfiguration.Builder()
.setNumberSlotsPerTaskManager(2)
.setNumberTaskManagers(1)
.build
)
before {
flinkCluster.before()
}
after {
flinkCluster.after()
}
test("Verify that table conversion works as expected") {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv = StreamTableEnvironment.create(env)
case class User(name: String, score: java.lang.Integer, event_time: java.time.Instant)
// create a DataStream
val dataStream = env.fromElements(
User("Alice", 4, Instant.ofEpochMilli(1000)),
User("Bob", 6, Instant.ofEpochMilli(1001)),
User("Alice", 10, Instant.ofEpochMilli(1002))
)
val table =
tableEnv.fromDataStream(
dataStream
)
table.printSchema()
}
}
According to documentation this should result in:
(
`name` STRING,
`score` INT,
`event_time` TIMESTAMP_LTZ(9)
)
What I get:
(
`f0` RAW('SanitySuite$User$1', '...')
)
If I instead modify my code in line with Example 5 - that is explicitly define a Schema that mirrors the case class, I instead get an error which very much looks like it results from the inability of extracting the case class fields:
Unable to find a field named 'event_time' in the physical data type derived from the given type information for schema declaration. Make sure that the type information is not a generic raw type. Currently available fields are: [f0]

the issue is with the imports, you are importing java classes and using scala classes for pojo.
Using following works:
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.configuration.Configuration
import org.apache.flink.connector.kafka.source.KafkaSource
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import org.apache.flink.streaming.api.scala._

Related

PyFlink - Scala UDF - How to convert Scala Map in Table API?

I'm trying to map the Map[String,String] object output of my Scala UDF (scala.collection.immutable.map) to some valid data type in the Table API, namely via Java type (java.util.Map) as recommended here: Flink Table API & SQL and map types (Scala). However I get below error.
Any idea about right way to proceed ? If yes, is there a way to generalize the conversion to a (nested) Scala object of type Map[String,Any] ?
Code
Scala UDF
class dummyMap() extends ScalarFunction {
def eval() = {
val whatevermap = Map("key1" -> "val1", "key2" -> "val2")
whatevermap.asInstanceOf[java.util.Map[java.lang.String,java.lang.String]]
}
}
Sink
my_sink_ddl = f"""
create table mySink (
output_of_dummyMap_udf MAP<STRING,STRING>
) with (
...
)
"""
Error
Py4JJavaError: An error occurred while calling o430.execute.
: org.apache.flink.table.api.ValidationException: Field types of query result and registered TableSink `default_catalog`.`default_database`.`mySink` do not match.
Query result schema: [output_of_my_scala_udf: GenericType<java.util.Map>]
TableSink schema: [output_of_my_scala_udf: Map<String, String>]
Thanks !
Original answer from Wei Zhong.
I'm just reporter. Thanks Wei !
At this point (Flink 1.11), two methods are working:
Current: DataTypeHint in UDF definition + SQL for UDF registering
Outdated: override getResultType in UDF definition + t_env.register_java_function for UDF registering
Code
Scala UDF
package com.dummy
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.table.annotation.DataTypeHint
import org.apache.flink.table.api.Types
import org.apache.flink.table.functions.ScalarFunction
import org.apache.flink.types.Row
class dummyMap extends ScalarFunction {
// If the udf would be registered by the SQL statement, you need add this typehint
#DataTypeHint("ROW<s STRING,t STRING>")
def eval(): Row = {
Row.of(java.lang.String.valueOf("foo"), java.lang.String.valueOf("bar"))
}
// If the udf would be registered by the method 'register_java_function', you need override this
// method.
override def getResultType(signature: Array[Class[_]]): TypeInformation[_] = {
// The type of the return values should be TypeInformation
Types.ROW(Array("s", "t"), Array[TypeInformation[_]](Types.STRING(), Types.STRING()))
}
}
Python code
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
s_env = StreamExecutionEnvironment.get_execution_environment()
st_env = StreamTableEnvironment.create(s_env)
# load the scala udf jar file, the path should be modified to yours
# or your can also load the jar file via other approaches
st_env.get_config().get_configuration().set_string("pipeline.jars", "file:///Users/zhongwei/the-dummy-udf.jar")
# register the udf via
st_env.execute_sql("CREATE FUNCTION dummyMap AS 'com.dummy.dummyMap' LANGUAGE SCALA")
# or register via the method
# st_env.register_java_function("dummyMap", "com.dummy.dummyMap")
# prepare source and sink
t = st_env.from_elements([(1, 'hi', 'hello'), (2, 'hi', 'hello')], ['a', 'b', 'c'])
st_env.execute_sql("""create table mySink (
output_of_my_scala_udf ROW<s STRING,t STRING>
) with (
'connector' = 'print'
)""")
# execute query
t.select("dummyMap()").execute_insert("mySink").get_job_client().get_job_execution_result().result()

Importing namespace and interface with methods in TypeScript (TSX) in pdfjs-dist (PDFJS)

I am trying to use pdfjs-dist in my React project, but get a lot of problems trying to import the module and the functions in the project.
The pdfjs-dist module index.d.ts in #types/node_modules is defined so that it contains a namespace "PDF" and a module "pdfjs-dist" which exports "PDF".
The file has interfaces, which contains methods such as "getDocument(name:string)" which I want to call from my other classes.
In short; the file consists of a lot of interfaces and methods that are implemented through this interface, on the form:
declare module "pdfjs-dist" {
export = PDF;
}
declare namespace PDF {
interface PDFJSStatic {
getDocument(
source: string,
pdfDataRangeTransport ? : any,
passwordCallback ? : (fn: (password: string) => void, reason: string) => string,
progressCallback ? : (progressData: PDFProgressData) => void): PDFPromise < PDFDocumentProxy > ;
}
I have tried to use the regular import statements, such as:
import * as PDF from "pdfjs-dist"
and
import { PDFJSStatic } from "pdfjs-dist"
However, it does not seem to respond very well. VS Code gives me all the interfaces, so I can see what they are, but this is where my knowledge of React and Typescript falls a bit short.
How would I go about calling the methods and actually using the "getDocument()" method?
For some reason the fix seems to be to import the interface first, so that the PDFJSStatic and other interfaces are available when using the require statement on line 2.
The import statements that I used were;
import { PDFJSStatic, PDFPageProxy } from "pdfjs-dist";
let PDFJS: PDFJSStatic = require("pdfjs-dist");
This is probably not the correct way of doing it, but it works.

Scala Slick-Extensions SQLServerDriver 2.1.0 usage - can't get it to compile

I am trying to use Slick-Extensions to connect to an SQL Server Database from Scala. I use slick 2.1.0 and slick-extensions 2.1.0.
I can't seem to get the code I wrote to compile. I followed the examples from slick's website and this compiled fine when the driver was H2. Please see below:
package com.example
import com.typesafe.slick.driver.ms.SQLServerDriver.simple._
import scala.slick.direct.AnnotationMapper.column
import scala.slick.lifted.TableQuery
import scala.slick.model.Table
class DestinationMappingsTable(tag: Tag) extends Table[(Long, Int, Int)](tag, "DestinationMappings_tbl") {
def id = column[Long]("id", O.PrimaryKey, O.AutoInc)
def mltDestinationType = column[Int]("mltDestinationType")
def mltDestinationId = column[Int]("mltDestinationId")
def * = (id, mltDestinationType, mltDestinationId)
}
I am getting a wide range of errors: scala.slick.model.Table does not take type parameters, column does not take type parameters and O not found.
If the SQLServerDriver does not use the same syntax as slick, where do I find its documentation?
Thank you!
I think to your import of scala.slick.model.Table shadows your import of com.typesafe.slick.driver.ms.SQLServerDriver.simple.Table
Try to just remove the:
import scala.slick.model.Table

Pass enum to ndb.Model field in python

I find How can I represent an 'Enum' in Python? for how to create enum in python. I have a field in my ndb.Model that I want to accept one of my enum values. Do I simply set the field to StringProperty? My enum is
def enum(**enums):
return type('Enum', (), enums)
ALPHA = enum(A="A", B="B", C="C", D="D")
This is fully supported in the ProtoRPC Python API and it's not worth rolling your own.
A simple Enum would look like the following:
from protorpc import messages
class Alpha(messages.Enum):
A = 0
B = 1
C = 2
D = 3
As it turns out, ndb has msgprop module for storing protorpc objects and this is documented.
So to store your Alpha enum, you'd do the following:
from google.appengine.ext import ndb
from google.appengine.ext.ndb import msgprop
class Part(ndb.Model):
alpha = msgprop.EnumProperty(Alpha, required=True)
...
EDIT: As pointed out by hadware, a msgprop.EnumProperty is not indexed by default. If you want to perform queries over such properties you'd need to define the property as
alpha = msgprop.EnumProperty(Alpha, required=True, indexed=True)
and then perform queries
ndb.query(Part.alpha == Alpha.B)
or use any value other than Alpha.B.

different development/production databases in scalaquery

ScalaQuery requires (AFAIK) to use an provider specific import in your code, for example:
import org.scalaquery.ql.extended.H2Driver.Implicit._
We are trying to use H2 in development mode and MySQL in production. Is there a way to achieve this?
My approach was:
class Subscribers(database: Database)(profile: ExtendedProfile) {
import profile.Implicit._
}
Where Subscribers basically is my Data-Access-Object.
Not sure this is the best approach out there. It solved my case.
You would create such DAO like:
...in production code:
new Subscribers(database)(MySQLDriver)
...and in test code:
new Subscribers(database)(H2Driver)
I use the following in playframework
object test {
lazy val extendedProfile = {
val extendedProfileName = Play.configuration getString "db.default.extendedProfile" get
companionObjectNamed(extendedProfileName).asInstanceOf[ExtendedProfile]
}
def companionObjectNamed(name: String) : AnyRef = {
val c = Class forName (name + "$")
c.getField("MODULE$") get c
}
}
And then import
import util.extendedProfile.Implicit._
org.scalaquery.ql.extended.MySQLDriver is the string I used in config to make mysql work.

Resources