mapreduce fails with message "The request to API call datastore_v3.Put() was too large." - google-app-engine

I am running a mapreduce job over 50 million User records.
For each user I read two other Datastore entities and then stream stats for each player to bigquery.
My first dry run (with streaming to bigquery disabled) failed with the following stacktrace.
/_ah/pipeline/handleTask
com.google.appengine.tools.cloudstorage.NonRetriableException: com.google.apphosting.api.ApiProxy$RequestTooLargeException: The request to API call datastore_v3.Put() was too large.
at com.google.appengine.tools.cloudstorage.RetryHelper.doRetry(RetryHelper.java:121)
at com.google.appengine.tools.cloudstorage.RetryHelper.runWithRetries(RetryHelper.java:166)
at com.google.appengine.tools.cloudstorage.RetryHelper.runWithRetries(RetryHelper.java:157)
at com.google.appengine.tools.pipeline.impl.backend.AppEngineBackEnd.tryFiveTimes(AppEngineBackEnd.java:196)
at com.google.appengine.tools.pipeline.impl.backend.AppEngineBackEnd.saveWithJobStateCheck(AppEngineBackEnd.java:236)
I have googled this error and the only thing I find is related to that the Mapper is too big to be serialized but our Mapper has no data at all.
/**
* Adds stats for a player via streaming api.
*/
public class PlayerStatsMapper extends Mapper<Entity, Void, Void> {
private static Logger log = Logger.getLogger(PlayerStatsMapper.class.getName());
private static final long serialVersionUID = 1L;
private String dataset;
private String table;
private transient GbqUtils gbq;
public PlayerStatsMapper(String dataset, String table) {
gbq = Davinci.getComponent(GbqUtils.class);
this.dataset = dataset;
this.table = table;
}
private void readObject(java.io.ObjectInputStream in) throws IOException, ClassNotFoundException {
in.defaultReadObject();
log.info("IOC reinitating due to deserialization.");
gbq = Davinci.getComponent(GbqUtils.class);
}
#Override
public void beginShard() {
}
#Override
public void endShard() {
}
#Override
public void map(Entity value) {
if (!value.getKind().equals("User")) {
log.severe("Expected a User but got a " + value.getKind());
return;
}
User user = new User(1, value);
List<Map<String, Object>> rows = new LinkedList<Map<String, Object>>();
List<PlayerStats> playerStats = readPlayerStats(user.getUserId());
addRankings(user.getUserId(), playerStats);
for (PlayerStats ps : playerStats) {
rows.add(ps.asMap());
}
// if (rows.size() > 0)
// gbq.insert(dataset, table, rows);
}
.... private methods only
}
The maprecuce job is started with this code
MapReduceSettings settings = new MapReduceSettings().setWorkerQueueName("mrworker");
settings.setBucketName(gae.getAppName() + "-playerstats");
// #formatter:off <I, K, V, O, R>
MapReduceSpecification<Entity, Void, Void, Void, Void> spec =
MapReduceSpecification.of("Enque player stats",
new DatastoreInput("User", shardCountMappers),
new PlayerStatsMapper(dataset, "playerstats"),
Marshallers.getVoidMarshaller(),
Marshallers.getVoidMarshaller(),
NoReducer.<Void, Void, Void> create(),
NoOutput.<Void, Void> create(1));
// #formatter:on
String jobId = MapReduceJob.start(spec, settings);

Well I solved this by backing to appengine-mapreduce-0.2.jar which was the one we had used before. The one used above was appengine-mapreduce-0.5.jar which actually turned out not to work for us.
When backing to 0.2 the console _ah/pipiline/list started to work again as well!
Anyone else that have encountered similar problem with 0.5?

Related

How to check if a MapState is empty in flink 1.8

I have an application where I am reading all the data from a DB for the first time and add it to MapState. Here is my RichCoFlatMapFunction
private transient MapState<String, Record> mapState;
#Override
public void open(Configuration parameters) throws Exception {
mapState = getRuntimeContext().getMapState(new MapStateDescriptor<String, Record>("recordState",
TypeInformation.of(new TypeHint<String>(){}), TypeInformation.of(new TypeHint<Record>() {})));
}
#Override
public void flatMap1(Record record, Collector<OutputRecord> collector) throws Exception {
readForFirstTime();
mapState.put(item.getId(), item);
}
#Override
public void flatMap2(Item item, Collector<OutputRecord> collector) throws Exception {
readForFirstTime();
Record record = mapState.get(item.getId);
System.out.println("Item arrived at time:"+ item.getTimestamp() +
". Record at the exact same time:" + record.toString());
}
private void readForFirstTime() {
// I need a mechanism here to detect if recordState is empty
// then only listAllFromDB
for(Record record: listAllFromDB) {
mapState.put(record.getId(), record);
}
}
So when I start my application from snapshot, I assume MapState will contain data and I do not want to read from DB. How can I check if the MapState is empty or contains data ?
If I understand correctly, you want to load database data only once, usually you do this at the open(.) method. Or you can use another MapState to the database data and use the MapState::isEmpty().

Flink streaming example that generates its own data

Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}

Flink StreamingFileSink not writing data to AWS S3

I have a collection that represents a data stream and testing StreamingFileSink to write the stream to S3. Program running successfully, but there is no data in the given S3 path.
public class S3Sink {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(100);
List<String> input = new ArrayList<>();
input.add("test");
DataStream<String> inputStream = see.fromCollection(input);
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("<S3 Path>"),
new SimpleStringEncoder<>("UTF-8"))
.withRollingPolicy(rollingPolicy)
.build();
inputStream.addSink(s3Sink);
see.execute();
}
}
Checkpointing enabled as well. Any thoughts on why Sink is not working as expected ?
UPDATE:
Based on David's answer, created custom source which generates random string continuously and I am expecting Checkpointing to trigger after configured interval to write the data to S3.
public class S3SinkCustom {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(1000);
DataStream<String> inputStream = see.addSource(new CustomSource());
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("s3://mybucket/data/"),
new SimpleStringEncoder<>("UTF-8"))
.build();
//inputStream.print();
inputStream.addSink(s3Sink);
see.execute();
}
static class CustomSource extends RichSourceFunction<String> {
private volatile boolean running = false;
final String[] strings = {"ABC", "XYZ", "DEF"};
#Override
public void open(Configuration parameters){
running = true;
}
#Override
public void run(SourceContext sourceContext) throws Exception {
while (running) {
Random random = new Random();
int index = random.nextInt(strings.length);
sourceContext.collect(strings[index]);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
running = false;
}
}
}
Still, There is no data in s3 and Flink Process is not even validating given S3 bucket is valid or not, but the process running without any issues.
Update:
Below is the custom rolling policy details:
public class CustomRollingPolicy implements RollingPolicy<Object, String> {
#Override
public boolean shouldRollOnCheckpoint(PartFileInfo partFileInfo) throws IOException {
return partFileInfo.getSize() > 1;
}
#Override
public boolean shouldRollOnEvent(PartFileInfo partFileInfo, Object o) throws IOException {
return true;
}
#Override
public boolean shouldRollOnProcessingTime(PartFileInfo partFileInfo, long l) throws IOException {
return true;
}
}
I believe the issue is that the job you've written isn't going to run long enough to actually checkpoint, so the output isn't going to be finalized.
Another potential issue is that the StreamingFileSink only works with the Hadoop-based S3 filesystem (and not the one from Presto).
Above issue is resolved after setting up flink-conf.yaml with required s3a properties like fs.s3a.access.key,fs.s3a.secret.key.
We need to let Flink know about the config location as well.
FileSystem.initialize(GlobalConfiguration.loadConfiguration(""));
With these changes, I was able to run S3 sink from local and messages persisted to S3 without any issues.

CassandraPojoSink no error, however data is not getting written into the cassandra

I am trying to write Flink's Cassandra SINK connector using CassandraPojoSink class. I am not getting any error/exception, but No records committed into the Cassandra table.
I am using following code.
========= Sink connector code snapshot ==================
DataStream<Event> stream = eventStream.flatMap(new EventTransformation());
try {
stream.addSink(new CassandraPojoSink<>(Event.class, new ClusterBuilder() {
private static final long serialVersionUID = -2485105213096858846L;
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoint("localhost").withPort(9042).build();
}
}));
} catch (Exception e) {
e.printStackTrace();
}
====== POJO CLASS ================
#Table(keyspace= "cloud", name = "event")
public class Event implements Serializable {
private static final long serialVersionUID = 3284839826384795926L;
#Column(name = "name")
private String name;
#Column(name = "msg")
private String msg;
public Event(){
}
//......
}
There are many reasons why a Flink job might fail to produce any output. Some common reasons include:
the app doesn't call env.execute()
the app is set to use event time, but there's no watermark generator
the watermarking logic is confused somehow, and no watermarks are being generated (e.g., the app is generating watermarks based on the CPU clock rather than event timestamps, causing every event to be late)
On changing POJO to Tuple, adding timestamp watermarks code is working perfectly.
I am able to see my data is written into the Cassandra database.
DataStream> events = event_stream.flatMap(new EventTransformation()).assignTimestampsAndWatermarks(
new AssignerWithPeriodicWatermarks>() {
private static final long serialVersionUID = 1L;
private final long maxOutOfOrderness = 1_000L; // 1
// second
private long currentMaxTimestamp = 0;
#Override
public long extractTimestamp(Tuple3<String, String, Long> arg0,
long arg1) {
long timestamp = arg0.f3; // get
currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
return timestamp;
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
});
event_stream.addSink(new CassandraTupleSink<Tuple3<String, String, Long>("INSERT INTO cloud.condition (name, msg, time) VALUES (?,?,?);", new ClusterBuilder() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
protected Cluster buildCluster(Builder builder) {
return builder.addContactPoint("localhost").withPort(9042).build();
}
}));
env.setParallelism(2);
env.execute();

Apache Flink - use values from a data stream to dynamically create a streaming data source

I'm trying to build a sample application using Apache Flink that does the following:
Reads a stream of stock symbols (e.g. 'CSCO', 'FB') from a Kafka queue.
For each symbol performs a real-time lookup of current prices and streams the values for downstream processing.
* Update to original post *
I moved the map function into a separate class and do not get the run-time error message "The implementation of the MapFunction is not serializable any more. The object probably contains or references non serializable fields".
The issue I'm facing now is that the Kafka topic "stockprices" I'm trying to write the prices to is not receiving them. I'm trying to trouble-shoot and will post any updates.
public class RetrieveStockPrices {
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "stocks");
DataStream<String> streamOfStockSymbols = streamExecEnv.addSource(new FlinkKafkaConsumer08<String>("stocksymbol", new SimpleStringSchema(), properties));
DataStream<String> stockPrice =
streamOfStockSymbols
//get unique keys
.keyBy(new KeySelector<String, String>() {
#Override
public String getKey(String trend) throws Exception {
return trend;
}
})
//collect events over a window
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
//return the last event from the window...all elements are the same "Symbol"
.apply(new WindowFunction<String, String, String, TimeWindow>() {
#Override
public void apply(String key, TimeWindow window, Iterable<String> input, Collector<String> out) throws Exception {
out.collect(input.iterator().next().toString());
}
})
.map(new StockSymbolToPriceMapFunction());
streamExecEnv.execute("Retrieve Stock Prices");
}
}
public class StockSymbolToPriceMapFunction extends RichMapFunction<String, String> {
#Override
public String map(String stockSymbol) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
System.out.println("StockSymbolToPriceMapFunction: stockSymbol: " + stockSymbol);
DataStream<String> stockPrices = streamExecEnv.addSource(new LookupStockPrice(stockSymbol));
stockPrices.keyBy(new CustomKeySelector()).addSink(new FlinkKafkaProducer08<String>("localhost:9092", "stockprices", new SimpleStringSchema()));
return "100000";
}
private static class CustomKeySelector implements KeySelector<String, String> {
#Override
public String getKey(String arg0) throws Exception {
return arg0.trim();
}
}
}
public class LookupStockPrice extends RichSourceFunction<String> {
public String stockSymbol = null;
public boolean isRunning = true;
public LookupStockPrice(String inSymbol) {
stockSymbol = inSymbol;
}
#Override
public void open(Configuration parameters) throws Exception {
isRunning = true;
}
#Override
public void cancel() {
isRunning = false;
}
#Override
public void run(SourceFunction.SourceContext<String> ctx)
throws Exception {
String stockPrice = "0";
while (isRunning) {
//TODO: query Google Finance API
stockPrice = Integer.toString((new Random()).nextInt(100)+1);
ctx.collect(stockPrice);
Thread.sleep(10000);
}
}
}
StreamExecutionEnvironment are not indented to be used inside of operators of a streaming application. Not intended means, this is not tested and encouraged. It might work and do something, but will most likely not behave well and probably kill your application.
The StockSymbolToPriceMapFunction in your program specifies for each incoming record a completely new and independent new streaming application. However, since you do not call streamExecEnv.execute() the programs are not started and the map method returns without doing anything.
If you would call streamExecEnv.execute(), the function would start a new local Flink cluster in the workers JVM and start the application on this local Flink cluster. The local Flink instance will take a lot of the heap space and after a few clusters have been started, the worker will probably die due to an OutOfMemoryError which is not what you want to happen.

Resources