I was trying to use table api inside flatMap by passing the flink env object to the flatMap object. But I was getting serialization exception which tells that I have added some field which cannot be serializable.
Could you please give some light on this?
Regards,
Sajeev
You cannot pass the ExecutionEnvironment into a Function. It would be like passing Flink into Flink.
The Table API is an abstraction on top of the DataSet/DataStream APIs. If you want to use both the Table API and the lower API, you can use TableEnvironment#toDataSet/fromDataSet to change between the APIs even between DataSet operators.
DataSet<Integer> ds = env.fromElements(1, 2, 3);
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
Table t = tEnv.fromDataSet(ds, "intCol"); // continue in Table API
Table t2 = t.select("intCol.cast(STRING)"); // do something with table
DataSet<String> ds2 = tEnv.toDataSet(t2); // continue in DataSet API
Related
How to read json file format in Apache flink using java.
I am not able to find any proper code to read json file in flink using java and do some transformation on top of it.
Any suggestions or code is highly appreciated.
For using Kafka with the DataStream API, see https://stackoverflow.com/a/62072265/2000823. The idea is to implement an appropriate DeserializationSchema, or KafkaDeserializationSchema. There's an example (and pointers to more) in the answer I've linked to above.
Or if you want to use the Table API or SQL, it's easier. You can configure this with a bit of DDL. For example:
CREATE TABLE minute_stats (
`minute` TIMESTAMP(3),
`currency` STRING,
`revenueSum` DOUBLE,
`orderCnt` BIGINT,
WATERMARK FOR `minute` AS `minute` - INTERVAL '10' SECOND
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'minute_stats',
'connector.properties.zookeeper.connect' = 'not-needed',
'connector.properties.bootstrap.servers' = 'kafka:9092',
'connector.startup-mode' = 'earliest-offset',
'format.type' = 'json'
);
For trying things out locally while reading from a file, you'll need to do things differently. Something like this
DataStreamSource<String> rawInput = env.readFile(
new TextInputFormat(new Path(fileLocation)), fileLocation);
DataStream<Event> = rawInput.flatMap(new MyJSONTransformer());
where MyJSONTransformer might use a jackson ObjectMapper to convert JSON into some convenient Event type (a POJO).
I'm using Gatling with the JDBC feeder and would like to dynamically add a parameter to the JDBC feeder's where clause based on the response from a previous request. Here is my example, I'm trying to do a post that will create a user, then have the feed grab the user's generated UUID using the userId returned from the create user request, then post some data with the UUID.
val dbConnectionString = "jdbc:mysql://localhost:3306/user"
val sqlQuery = "SELECT user_uuid FROM users where user_id = '${userId}'"
val sqlUserName = "dbUser"
val sqlPassword = "dbPassword"
val sqlQueryFeeder = jdbcFeeder(dbConnectionString, sqlUserName, sqlPassword, sqlQuery)
val uuidPayload = """{"userUUID":"${user_uuid}"}"""
val MyScenario = scenario("MyScenario").exec(
(pause(1, 2))
.exec(http("SubmitFormThatCreatesUserData")
.post(USER_CREATE_URL)
.body(StringBody("""{"username":"test#test.com"}""")).asJson
.header("Accept", "application/json")
.check(status.is(200))
.check(jsonPath("$..data.userId").exists.saveAs("userId")))
.feed(sqlQueryFeeder)
.exec(http("SubmitStuffWithUUID")
.post(myUUIDPostURL)
.body(uuidPayload).asJson
.header("Accept", "application/json")
.check(status.is(200)))
)
I have verified the following:
1) The user data does get inserted into the DB correctly on the form post
2) The userId is returned from that form post
3) The userId correctly saved as a Gatling session variable
4) The SQL query will execute correctly if I hard-code the user id variable
The problem I have is that when I have the Gatling ${userId} parameter on the JDBC feeder's where clause it appears the userId variable isn't used, I get an error saying java.lang.IllegalArgumentException: requirement failed: Feeder must not be empty. When I replace the ${userId} with a hard-coded userId everything works as expected. I would just like to know how I can use the userId session parameter in my JDBC feeder's where clause.
The call to jdbcFeeder(dbConnectionString, sqlUserName, sqlPassword, sqlQuery) to create a jdbc feeder only takes strings as parameters, not gatling expressions (like ${userId}).
In your scenario as posted, you are not really using feeders as intended - they are generally used to have different users pick up different values from a pool of values whereas you have a static user name and are only taking the first value returned from the db. It's generally not a good idea to have fetching of external data in the middle of your scenarios as it can make timings unpredictable.
Could you just look up and hardcode the user_uuid? The best approach would be to get all your user data and look up things like uuids in the before block of the simulation. However, you can't use the gatling DSL there.
You could also use a scala variable to store the user_uuid and define a feeder inline, but this would get messy if you do need to support multiple users
I'm reading from a Kafka cluster in a Flink streaming app. After getting the source stream i want to aggregate events by a composite key and a timeEvent tumbling window and then write result to a table.
The problem is that after applying my aggregateFunction that just counts number of clicks by clientId i don't find the way to get the key of each output record since the api returns an instance of accumulated result but not the corresponding key.
DataStream<Event> stream = environment.addSource(mySource)
stream.keyBy(new KeySelector<Event,Integer>() {
public Integer getKey(Event event) { return event.getClientId(); })
.window(TumblingEventTimeWindows.of(Time.minutes(1))).aggregate(new MyAggregateFunction)
How do i get the key that i specified before? I did not inject key of the input events in the accumulator as i felt i wouldn't be nice.
Rather than
.aggregate(new MyAggregateFunction)
you can use
.aggregate(new MyAggregateFunction, new MyProcessWindowFunction)
and in this case the process method of your ProcessWindowFunction will be passed the key, along with the pre-aggregated result of your AggregateFunction and a Context object with other potentially relevant info. See the section in the docs on ProcessWindowFunction with Incremental Aggregation for more details.
I am dealing with a stream of database mutations, i.e., a change log stream. I want to able to transform the values using a SQL query.
I am having difficulty putting together the following three concepts
RowTypeInfo, Row, and DataStream.
NOTE: I don't know the schema beforehand. I construct it on-fly using the data within the Mutation object (Mutation is a custom type)
More specifically I have code that looks like this.
val execEnv = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(execEnv)
// Mutation is a custom type
val mutationStream: DataStream[Mutation] = ...
// toRows returns an object of type org.apache.flink.types.Row
val rowStream:DataStream[Row] = mutationStream.flatMap({mutation => toRows(mutation)})
tableEnv.registerDataStream("spinal_tap_table", rowStream)
tableEnv.sql("select col1 + 2")
NOTE: Row object is positional, and doesn't have a placeholder for column names.
I couldn't find a place to attach the schema to the DataStream object.
I want to pass some sort of a struct similar to Row that contains the complete information {columnName: String, columnValue: Object, columnType: TypeInformation[_]} for the query.
In Flink SQL a table schema is mandatory when the Table defined. It is not possible to run queries on dynamically typed records.
Regarding the concepts of RowTypeInfo, Row and DataStream:
Row is the actual record that holds the data
RowTypeInfo is a schema description for Rows. It contains names and TypeInformation for each field of a Row.
DataStream is a logical stream of records. A DataStream[Row] is a stream of rows. Note that this is not the actual stream but just an API concept to represent a stream in the API.
I was surprised to find that there are no outer joins for DataStream in Flink (DataStream docs).
For DataSet you have all the options: leftOuterJoin, rightOuterJoin and fullOuterJoin, apart from the regular join (DataSet docs). But for DataStream you just have the plain old join.
Is this due to some fundamental properties of the DataStream that make it impossible to have outer joins? Or maybe we can expect this in the (close?) future?
I could really use an outer join on DataStream for the problem I'm working on... Is there any way to achieve a similar behaviour?
You can implement outer joins using the DataStream.coGroup() transformation. A CoGroupFunction receives two iterators (one for each input), which serve all elements of a certain key and which may be empty if no matching element is found. This allows to implement outer join functionality.
First-class support for outer joins might be added to the DataStream API in one of the next releases of Flink. I am not aware of any such efforts at the moment. However, creating an issue in the Apache Flink JIRA could help.
One way would be to go from a stream -> table -> stream, using the following api: FLINK TABLE API - OUTER JOIN
Here is a java example:
DataStream<String> data = env.readTextFile( ... );
DataStream<String> data2Merge = env.readTextFile( ... );
...
tableEnv.registerDataStream("myDataLeft", data, "left_column1, left_column2");
tableEnv.registerDataStream("myDataRight", data2Merge, "right_column1, right_column2");
String queryLeft = "SELECT left_column1, left_column2 FROM myDataLeft";
String queryRight = "SELECT right_column1, right_column2 FROM myDataRight";
Table tableLeft = tableEnv.sqlQuery(queryLeft);
Table tableRight = tableEnv.sqlQuery(queryRight);
Table fullOuterResult = tableLeft.fullOuterJoin(tableRight, "left_column1 == right_column1").select("left_column1, left_column2, right_column2");
DataStream<Tuple2<Boolean, Row>> retractStream = tableEnv.toRetractStream(fullOuterResult, Row.class);