I am trying to run below simple Flink job to count words using TableAPI. Used DataStream API to read a stream of data and used StreamTableEnvironment API to create a Table environment. I am getting below exception. Can someone please help me what is wrong in code? I am using Flink 1.8 version.
Exception:
**Exception in thread "main" org.apache.flink.table.api.TableException: Only the first field can reference an atomic type.**
at org.apache.flink.table.api.TableEnvironment$$anonfun$5.apply(TableEnvironment.scala:1117)
at org.apache.flink.table.api.TableEnvironment$$anonfun$5.apply(TableEnvironment.scala:1112)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at org.apache.flink.table.api.TableEnvironment.getFieldInfo(TableEnvironment.scala:1112)
at org.apache.flink.table.api.StreamTableEnvironment.registerDataStreamInternal(StreamTableEnvironment.scala:546)
at org.apache.flink.table.api.java.StreamTableEnvironment.fromDataStream(StreamTableEnvironment.scala:91)
at Udemy_Course.TableAPIExample.CountWordExample.main(CountWordExample.java:30)
Code
public class CountWordExample {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment=StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment streamTableEnvironment=StreamTableEnvironment.create(environment);
DataStream<WC> streamOfWords =
environment.fromElements(
new WC("Hello",1L),
new WC("Howdy",1L),
new WC("Hello",1L),
new WC("Hello",1L));
Table t1 = streamTableEnvironment.fromDataStream(streamOfWords, "word, count");
Table result = streamTableEnvironment.sqlQuery("select word, count(word) as wordcount
from " + t1 + " group by word");
streamTableEnvironment.toRetractStream(result, CountWordExample.class ).print();
environment.execute();
}
public static class WC {
private String word;
private Long count;
public WC() {}
public WC(String word,Long count) {
this.word=word;
this.count=count;
}
}
}
All that's needed to get this running is to change the private word and count fields in WC so that they are public, and to use Row.class rather than CountWordExample.class in toRetractStream.
Those fields either have to be public, or have publicly accessible getters and setters in order for the SQL engine to work with them.
public class CountWordExample {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment=StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment streamTableEnvironment=StreamTableEnvironment.create(environment);
DataStream<WC> streamOfWords =
environment.fromElements(
new WC("Hello",1L),
new WC("Howdy",1L),
new WC("Hello",1L),
new WC("Hello",1L));
Table t1 = streamTableEnvironment.fromDataStream(streamOfWords, "word, count");
Table result = streamTableEnvironment.sqlQuery("select word, count(word) as wordcount from " + t1 + " group by word");
streamTableEnvironment.toRetractStream(result, Row.class).print();
environment.execute();
}
public static class WC {
public String word;
public Long count;
public WC() {}
public WC(String word,Long count) {
this.word=word;
this.count=count;
}
}
}
Related
Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}
I have a collection that represents a data stream and testing StreamingFileSink to write the stream to S3. Program running successfully, but there is no data in the given S3 path.
public class S3Sink {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(100);
List<String> input = new ArrayList<>();
input.add("test");
DataStream<String> inputStream = see.fromCollection(input);
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("<S3 Path>"),
new SimpleStringEncoder<>("UTF-8"))
.withRollingPolicy(rollingPolicy)
.build();
inputStream.addSink(s3Sink);
see.execute();
}
}
Checkpointing enabled as well. Any thoughts on why Sink is not working as expected ?
UPDATE:
Based on David's answer, created custom source which generates random string continuously and I am expecting Checkpointing to trigger after configured interval to write the data to S3.
public class S3SinkCustom {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(1000);
DataStream<String> inputStream = see.addSource(new CustomSource());
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("s3://mybucket/data/"),
new SimpleStringEncoder<>("UTF-8"))
.build();
//inputStream.print();
inputStream.addSink(s3Sink);
see.execute();
}
static class CustomSource extends RichSourceFunction<String> {
private volatile boolean running = false;
final String[] strings = {"ABC", "XYZ", "DEF"};
#Override
public void open(Configuration parameters){
running = true;
}
#Override
public void run(SourceContext sourceContext) throws Exception {
while (running) {
Random random = new Random();
int index = random.nextInt(strings.length);
sourceContext.collect(strings[index]);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
running = false;
}
}
}
Still, There is no data in s3 and Flink Process is not even validating given S3 bucket is valid or not, but the process running without any issues.
Update:
Below is the custom rolling policy details:
public class CustomRollingPolicy implements RollingPolicy<Object, String> {
#Override
public boolean shouldRollOnCheckpoint(PartFileInfo partFileInfo) throws IOException {
return partFileInfo.getSize() > 1;
}
#Override
public boolean shouldRollOnEvent(PartFileInfo partFileInfo, Object o) throws IOException {
return true;
}
#Override
public boolean shouldRollOnProcessingTime(PartFileInfo partFileInfo, long l) throws IOException {
return true;
}
}
I believe the issue is that the job you've written isn't going to run long enough to actually checkpoint, so the output isn't going to be finalized.
Another potential issue is that the StreamingFileSink only works with the Hadoop-based S3 filesystem (and not the one from Presto).
Above issue is resolved after setting up flink-conf.yaml with required s3a properties like fs.s3a.access.key,fs.s3a.secret.key.
We need to let Flink know about the config location as well.
FileSystem.initialize(GlobalConfiguration.loadConfiguration(""));
With these changes, I was able to run S3 sink from local and messages persisted to S3 without any issues.
I am trying to write Flink's Cassandra SINK connector using CassandraPojoSink class. I am not getting any error/exception, but No records committed into the Cassandra table.
I am using following code.
========= Sink connector code snapshot ==================
DataStream<Event> stream = eventStream.flatMap(new EventTransformation());
try {
stream.addSink(new CassandraPojoSink<>(Event.class, new ClusterBuilder() {
private static final long serialVersionUID = -2485105213096858846L;
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("localhost").withPort(9042).build();
}
}));
} catch (Exception e) {
e.printStackTrace();
}
====== POJO CLASS ================
#Table(keyspace= "cloud", name = "event")
public class Event implements Serializable {
private static final long serialVersionUID = 3284839826384795926L;
#Column(name = "name")
private String name;
#Column(name = "msg")
private String msg;
public Event(){
}
//......
}
There are many reasons why a Flink job might fail to produce any output. Some common reasons include:
the app doesn't call env.execute()
the app is set to use event time, but there's no watermark generator
the watermarking logic is confused somehow, and no watermarks are being generated (e.g., the app is generating watermarks based on the CPU clock rather than event timestamps, causing every event to be late)
On changing POJO to Tuple, adding timestamp watermarks code is working perfectly.
I am able to see my data is written into the Cassandra database.
DataStream> events = event_stream.flatMap(new EventTransformation()).assignTimestampsAndWatermarks(
new AssignerWithPeriodicWatermarks>() {
private static final long serialVersionUID = 1L;
private final long maxOutOfOrderness = 1_000L; // 1
// second
private long currentMaxTimestamp = 0;
#Override
public long extractTimestamp(Tuple3<String, String, Long> arg0,
long arg1) {
long timestamp = arg0.f3; // get
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
});
event_stream.addSink(new CassandraTupleSink<Tuple3<String, String, Long>("INSERT INTO cloud.condition (name, msg, time) VALUES (?,?,?);", new ClusterBuilder() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
protected Cluster buildCluster(Builder builder) {
return builder.addContactPoint("localhost").withPort(9042).build();
}
}));
env.setParallelism(2);
env.execute();
I'm trying to build a sample application using Apache Flink that does the following:
Reads a stream of stock symbols (e.g. 'CSCO', 'FB') from a Kafka queue.
For each symbol performs a real-time lookup of current prices and streams the values for downstream processing.
* Update to original post *
I moved the map function into a separate class and do not get the run-time error message "The implementation of the MapFunction is not serializable any more. The object probably contains or references non serializable fields".
The issue I'm facing now is that the Kafka topic "stockprices" I'm trying to write the prices to is not receiving them. I'm trying to trouble-shoot and will post any updates.
public class RetrieveStockPrices {
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "stocks");
DataStream<String> streamOfStockSymbols = streamExecEnv.addSource(new FlinkKafkaConsumer08<String>("stocksymbol", new SimpleStringSchema(), properties));
DataStream<String> stockPrice =
streamOfStockSymbols
//get unique keys
.keyBy(new KeySelector<String, String>() {
#Override
public String getKey(String trend) throws Exception {
return trend;
}
})
//collect events over a window
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
//return the last event from the window...all elements are the same "Symbol"
.apply(new WindowFunction<String, String, String, TimeWindow>() {
#Override
public void apply(String key, TimeWindow window, Iterable<String> input, Collector<String> out) throws Exception {
out.collect(input.iterator().next().toString());
}
})
.map(new StockSymbolToPriceMapFunction());
streamExecEnv.execute("Retrieve Stock Prices");
}
}
public class StockSymbolToPriceMapFunction extends RichMapFunction<String, String> {
#Override
public String map(String stockSymbol) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
System.out.println("StockSymbolToPriceMapFunction: stockSymbol: " + stockSymbol);
DataStream<String> stockPrices = streamExecEnv.addSource(new LookupStockPrice(stockSymbol));
stockPrices.keyBy(new CustomKeySelector()).addSink(new FlinkKafkaProducer08<String>("localhost:9092", "stockprices", new SimpleStringSchema()));
return "100000";
}
private static class CustomKeySelector implements KeySelector<String, String> {
#Override
public String getKey(String arg0) throws Exception {
return arg0.trim();
}
}
}
public class LookupStockPrice extends RichSourceFunction<String> {
public String stockSymbol = null;
public boolean isRunning = true;
public LookupStockPrice(String inSymbol) {
stockSymbol = inSymbol;
}
#Override
public void open(Configuration parameters) throws Exception {
isRunning = true;
}
#Override
public void cancel() {
isRunning = false;
}
#Override
public void run(SourceFunction.SourceContext<String> ctx)
throws Exception {
String stockPrice = "0";
while (isRunning) {
//TODO: query Google Finance API
stockPrice = Integer.toString((new Random()).nextInt(100)+1);
ctx.collect(stockPrice);
Thread.sleep(10000);
}
}
}
StreamExecutionEnvironment are not indented to be used inside of operators of a streaming application. Not intended means, this is not tested and encouraged. It might work and do something, but will most likely not behave well and probably kill your application.
The StockSymbolToPriceMapFunction in your program specifies for each incoming record a completely new and independent new streaming application. However, since you do not call streamExecEnv.execute() the programs are not started and the map method returns without doing anything.
If you would call streamExecEnv.execute(), the function would start a new local Flink cluster in the workers JVM and start the application on this local Flink cluster. The local Flink instance will take a lot of the heap space and after a few clusters have been started, the worker will probably die due to an OutOfMemoryError which is not what you want to happen.
I am running a mapreduce job over 50 million User records.
For each user I read two other Datastore entities and then stream stats for each player to bigquery.
My first dry run (with streaming to bigquery disabled) failed with the following stacktrace.
/_ah/pipeline/handleTask
com.google.appengine.tools.cloudstorage.NonRetriableException: com.google.apphosting.api.ApiProxy$RequestTooLargeException: The request to API call datastore_v3.Put() was too large.
at com.google.appengine.tools.cloudstorage.RetryHelper.doRetry(RetryHelper.java:121)
at com.google.appengine.tools.cloudstorage.RetryHelper.runWithRetries(RetryHelper.java:166)
at com.google.appengine.tools.cloudstorage.RetryHelper.runWithRetries(RetryHelper.java:157)
at com.google.appengine.tools.pipeline.impl.backend.AppEngineBackEnd.tryFiveTimes(AppEngineBackEnd.java:196)
at com.google.appengine.tools.pipeline.impl.backend.AppEngineBackEnd.saveWithJobStateCheck(AppEngineBackEnd.java:236)
I have googled this error and the only thing I find is related to that the Mapper is too big to be serialized but our Mapper has no data at all.
/**
* Adds stats for a player via streaming api.
*/
public class PlayerStatsMapper extends Mapper<Entity, Void, Void> {
private static Logger log = Logger.getLogger(PlayerStatsMapper.class.getName());
private static final long serialVersionUID = 1L;
private String dataset;
private String table;
private transient GbqUtils gbq;
public PlayerStatsMapper(String dataset, String table) {
gbq = Davinci.getComponent(GbqUtils.class);
this.dataset = dataset;
this.table = table;
}
private void readObject(java.io.ObjectInputStream in) throws IOException, ClassNotFoundException {
in.defaultReadObject();
log.info("IOC reinitating due to deserialization.");
gbq = Davinci.getComponent(GbqUtils.class);
}
#Override
public void beginShard() {
}
#Override
public void endShard() {
}
#Override
public void map(Entity value) {
if (!value.getKind().equals("User")) {
log.severe("Expected a User but got a " + value.getKind());
return;
}
User user = new User(1, value);
List<Map<String, Object>> rows = new LinkedList<Map<String, Object>>();
List<PlayerStats> playerStats = readPlayerStats(user.getUserId());
addRankings(user.getUserId(), playerStats);
for (PlayerStats ps : playerStats) {
rows.add(ps.asMap());
}
// if (rows.size() > 0)
// gbq.insert(dataset, table, rows);
}
.... private methods only
}
The maprecuce job is started with this code
MapReduceSettings settings = new MapReduceSettings().setWorkerQueueName("mrworker");
settings.setBucketName(gae.getAppName() + "-playerstats");
// #formatter:off <I, K, V, O, R>
MapReduceSpecification<Entity, Void, Void, Void, Void> spec =
MapReduceSpecification.of("Enque player stats",
new DatastoreInput("User", shardCountMappers),
new PlayerStatsMapper(dataset, "playerstats"),
Marshallers.getVoidMarshaller(),
Marshallers.getVoidMarshaller(),
NoReducer.<Void, Void, Void> create(),
NoOutput.<Void, Void> create(1));
// #formatter:on
String jobId = MapReduceJob.start(spec, settings);
Well I solved this by backing to appengine-mapreduce-0.2.jar which was the one we had used before. The one used above was appengine-mapreduce-0.5.jar which actually turned out not to work for us.
When backing to 0.2 the console _ah/pipiline/list started to work again as well!
Anyone else that have encountered similar problem with 0.5?