Why Flink task heap is used periodically, even though no coming data? - heap-memory

I found my Flink job is strange. The task heap was being used, but it didn't consume any data since it started.
It's my code, a stateful processing with three state:
var listState: ListState[String] = _
var valueState: ValueState[TestModel] = _
var mapState: MapState[String, TestModel] = _
override def open(parameters: Configuration): Unit = {
val descriptor = new ListStateDescriptor[String](
"buffered-elements",
createTypeInformation[String]
)
listState = getRuntimeContext.getListState(descriptor)
val adStateDescriptor = new ValueStateDescriptor("LastEventState", createTypeInformation[TestModel])
valueState = getRuntimeContext.getState(adStateDescriptor)
val mapStateDescriptor = new MapStateDescriptor("SubsegmentState",
BasicTypeInfo.STRING_TYPE_INFO, createTypeInformation[TestModel])
mapState = getRuntimeContext.getMapState(mapStateDescriptor)
}
override def map(in: TestModel): TestModel = {
......
}
TaskManager memory is 6GB, and the Task heap size is 2.3GB
It's the graph of heap usage, the blue line is my task

Related

Flink task Manager hangs

Here is the programme
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
ParameterTool parameters = ParameterTool.fromArgs(args);
String ftpUri ;
env.readTextFile(ftpUri,"UTF-8")
.map(mapFunction)
.keyBy(tuple2 -> tuple2.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.reduce((tuple2, t1) -> {
Collection newCol = new ArrayList<OpisRecord>();
Collections.addAll(newCol,tuple2.f1.toArray());
Collections.addAll(newCol,t1.f1.toArray());
return new Tuple2(tuple2.f0,newCol);
})
.addSink(new SinktoDistributedCache());
env.execute();
Works fine with for record size : 10k to 40k. But hangs up for anything above 40k.
I have tried increasing number task managers and parallelism but no gain.
Any clues ?

is it possible to make different keys have independent watermark

I am using Flink 1.12 and I have a keyed stream, in my code it looks that both A and B share the same watermark? and therefore B is determined as late because A's coming has upgraded the watermark to be 2020-08-30 10:50:11?
The output is A(2020-08-30 10:50:08, 2020-08-30 10:50:16):2020-08-30 10:50:15,there is no output for B
I would ask whether it is possible to make different keys have independent watermark? A's watermark and B'watermark change independently
The application code is:
import java.text.SimpleDateFormat
import java.util.Date
import java.util.concurrent.TimeUnit
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object DemoDiscardLateEvent4_KeyStream {
def to_milli(str: String) =
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(str).getTime
def to_char(milli: Long) = {
val date = if (milli <= 0) new Date(0) else new Date(milli)
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val data = Seq(
("A", "2020-08-30 10:50:15"),
("B", "2020-08-30 10:50:07")
)
env.fromCollection(data).setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[(String, String)]() {
var maxSeen = Long.MinValue
override def checkAndGetNextWatermark(lastElement: (String, String), extractedTimestamp: Long): Watermark = {
val eventTime = to_milli(lastElement._2)
if (eventTime > maxSeen) {
maxSeen = eventTime
}
//Allow 4 seconds late
new Watermark(maxSeen - 4000)
}
override def extractTimestamp(element: (String, String), previousElementTimestamp: Long): Long = to_milli(element._2)
}).keyBy(_._1).window(TumblingEventTimeWindows.of(Time.of(8, TimeUnit.SECONDS))).apply(new WindowFunction[(String, String), String, String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[(String, String)], out: Collector[String]): Unit = {
val start = to_char(window.getStart)
val end = to_char(window.getEnd)
val sb = new StringBuilder
//the start and end of the window
sb.append(s"$key($start, $end):")
//The content of the window
input.foreach {
e => sb.append(e._2 + ",")
}
out.collect(sb.toString().substring(0, sb.length - 1))
}
}).print()
env.execute()
}
}
While it would sometimes be helpful if Flink offered per-key watermarking, it does not.
Each parallel instance of your WatermarkStrategy (or in this case, of your AssignerWithPunctuatedWatermarks) is generating watermarks independently, based on the timestamps of the events it observes (regardless of their keys).
One way to work around the lack of this feature is to not use watermarks at all. For example, if you would be using per-key watermarks to trigger keyed event-time windows, you can instead implement your own windows using a KeyedProcessFunction, and instead of using watermarks to trigger event time timers, keep track of the largest timestamp seen so far for each key, and whenever updating that value, determine if you now want to close one or more windows for that key.
See one of the Flink training lessons for an example of how to implement keyed tumbling windows with a KeyedProcessFunction. This example depends on watermarks but should help you get started.

Flink windows do not fire after connecting to broadcast stream

I try to use BroadcastStatePattern to extend the functionality of my application.
Some code here. Main
// .... ///
val gatewayBroadcastStateDescriptor = new MapStateDescriptor[String, BCA]("gatewayEvents", classOf[String], classOf[BCASTDATACLASS])
// Broadcast source
val broadcastSource = env
.addSource(new FlinkKinesisConsumer[String](s"BROADCAST", new SimpleStringSchema, consumerConfig))
val broadcastSourceGatewayEvents = broadcastSource
.filter(_.contains("someText"))
.map(json => read[BCASTDATACLASS](json))
val broadcastGatewayEventsConfigurations = broadcastSourceGatewayEvents.broadcast(gatewayBroadcastStateDescriptor)
// packet source
val packetSource = env
.addSource(
new FlinkKinesisConsumer[String](s"PACKETS", new SimpleStringSchema, consumerConfig))
val packets = packetSource.disableChaining()
.map(json => read[MAINDATACLASS](json))
.assignTimestampsAndWatermarks(WatermarkStrategy
.forBoundedOutOfOrderness[MAINDATACLASS](Duration.ofSeconds(2))
.withTimestampAssigner(new PacketWatermarkGenerator))
.timeWindowAll(Time.seconds(2))
.process(new OrderPacketWindowFunction)
.disableChaining()
// connect MainDataSource with BroadcastDataSource
val gwEnrichedPackets = packets
.keyBy(_.gatewayId)
.connect(broadcastGatewayEventsConfigurations)
.process(new EnrichingPackets)
My window function (in this example doing nothing, just forward data further )
//....//
class EnrichingPackets()
extends KeyedBroadcastProcessFunction[String, MAINDATACLASS, BCASTDATACLASS, MAINDATACLASS]
with LazyLogging {
private lazy val gatewayEventsStateDescriptor =
new MapStateDescriptor[String, BCASTDATACLASS]("gatewayEvents", classOf[String], classOf[BCASTDATACLASS])
override def processBroadcastElement( // stream element, context, collector to emit resulting elements
broadcastInput: BCASTDATACLASS,
ctx: KeyedBroadcastProcessFunction[String, MAINDATACLASS, BCASTDATACLASS, MAINDATACLASS]#Context,
out: Collector[MAINDATACLASS]): Unit = {
val gatewayEvents = ctx.getBroadcastState(gatewayEventsStateDescriptor)
println("OK")
}
override def processElement(
packetInput: MAINDATACLASS,
readOnlyCtx: KeyedBroadcastProcessFunction[String, MAINDATACLASS, GatewayEvent, MAINDATACLASS]#ReadOnlyContext,
out: Collector[MAINDATACLASS]): Unit = {
// get read-only broadcast state
val gatewayEvents = readOnlyCtx.getBroadcastState(gatewayEventsStateDescriptor)
out.collect(packetInput)
}
}
After connecting data and configuration streams im going to open window and do some processing.
But when i open window from gwEnrichedPackets nothing happened, i can see (flink ui) ONLY incoming messages into window. Even using session windows and stop the data flow - windows do not fire.
allowedLateness and sideOutputLateData do not help the investigation of the problem
An interesting point is that if I open windows from packets - everything works properly.
// val sessionWindows = gwEnrichedPackets - NOT works
// val sessionWindows = packets - Works
val sessionWindows = gwEnrichedPackets
.keyBy(_.tag.tagId)
.timeWindow(Time.seconds(20))
//.window(EventTimeSessionWindows.withGap(Time.seconds(120)))
//.allowedLateness(Time.seconds(12000))
//.sideOutputLateData(new OutputTag[MAINDATACLASS]("late-readings"))
.process(new DetectTagGatewayDisconnections)
val lateStream = sessionWindows
.getSideOutput(new OutputTag[MAINDATACLASS]("late-readings"))
lateStream.print()
sessionWindows.print()
What am I doing wrong?
The problem is watermarking in this case, You are assigning Watermarks only to one of the streams, Flink always picks the lowest Watermark when more than one stream is on the input of the given operator.
So, in Your case Flink has to pick between Watermark generated by packets and the one generated by broadcast stream and one of them will be always Long.MinVal (because the control stream has no watermark generator), so it will always pick Long.MinVal and thus windows will never progress.
In this case, You can simply add Watermark assigner to the gwEnrichedPackets stream and that should solve the issue.

How to return FileStreamResult with SqlDataReader

I have an ASP.NET Core project that downloads large files which are stored in SQL Server. It works fine for small files, but large files often time out as they are read into memory before getting downloaded.
So I am working to improve that.
Based on SQL Client streaming support examples I have updated the code to the following:
public async Task<FileStreamResult> DownloadFileAsync(int id)
{
ApplicationUser user = await _userManager.GetUserAsync(HttpContext.User);
var file = await this._attachmentRepository.GetFileAsync(id);
using (SqlConnection connection = new SqlConnection(this.ConnectionString))
{
await connection.OpenAsync();
using (SqlCommand command = new SqlCommand("SELECT [Content] FROM [Attachments] WHERE [AttachmentId] = #id", connection))
{
command.Parameters.AddWithValue("id", file.AttachmentId);
SqlDataReader reader = await command.ExecuteReaderAsync(CommandBehavior.SequentialAccess);
if (await reader.ReadAsync())
{
if (!(await reader.IsDBNullAsync(0)))
{
Stream stream = reader.GetStream(0);
var result = new FileStreamResult(stream, file.ContentType)
{
FileDownloadName = file.FileName
};
return result;
}
}
}
}
return null;
}
But when I test, it throws this exception:
Cannot access a disposed object. Object name: 'SqlSequentialStream'
Is there a way to fix this exception?
Your using statements are all triggering when you do your return, thus disposing your connection and command, but the whole point of this is to leave the stream copy to happen in the background after your function is done.
For this pattern you're going to have to remove the using calls and let garbage collection trigger when the stream copy is done. FileStreamResult should at the very least call Dispose on the stream you give it, which should un-root the command and connection to be later finalized and closed.
This is the working code, which is dramatically faster than without the streaming:
[HttpGet("download")]
public async Task<FileStreamResult> DownloadFileAsync(int id)
{
var connectionString = _configuration.GetConnectionString("DefaultConnection");
ApplicationUser user = await _userManager.GetUserAsync(HttpContext.User);
var fileInfo = await this._attachmentRepository.GetAttachmentInfoByIdAsync(id);
SqlConnection connection = new SqlConnection(connectionString);
await connection.OpenAsync();
SqlCommand command = new SqlCommand("SELECT [Content] FROM [Attachments] WHERE [AttachmentId]=#id", connection);
command.Parameters.AddWithValue("id", fileInfo.Id);
// The reader needs to be executed with the SequentialAccess behavior to enable network streaming
// Otherwise ReadAsync will buffer the entire BLOB into memory which can cause scalability issues or even OutOfMemoryExceptions
SqlDataReader reader = await command.ExecuteReaderAsync(CommandBehavior.SequentialAccess);
if (await reader.ReadAsync())
{
if (!(await reader.IsDBNullAsync(0)))
{
Stream stream = reader.GetStream(0);
var result = new FileStreamResult(stream, fileInfo.ContentType)
{
FileDownloadName = fileInfo.FileName
};
return result;
}
}
return null;
}

Watermarks in a RichParallelSourceFunction

I am implementing a SourceFunction, which reads Data from a Database.
The job should be able to be resumed if stopped or crushed (i.e savepoints and checkpoints) with the data being processed exactly once.
What I have so far:
#SerialVersionUID(1L)
class JDBCSource(private val waitTimeMs: Long) extends
RichParallelSourceFunction[Event] with StoppableFunction with LazyLogging{
#transient var client: PostGreClient = _
#volatile var isRunning: Boolean = true
val DEFAULT_WAIT_TIME_MS = 1000
def this(clientConfig: Serializable) =
this(clientConfig, DEFAULT_WAIT_TIME_MS)
override def stop(): Unit = {
this.isRunning = false
}
override def open(parameters: Configuration): Unit = {
super.open(parameters)
client = new JDBCClient
}
override def run(ctx: SourceFunction.SourceContext[Event]): Unit = {
while (isRunning){
val statement = client.getConnection.createStatement()
val resultSet = statement.executeQuery("SELECT name, timestamp FROM MYTABLE")
while (resultSet.next()) {
val event: String = resultSet.getString("name")
val timestamp: Long = resultSet.getLong("timestamp")
ctx.collectWithTimestamp(new Event(name, timestamp), timestamp)
}
}
}
override def cancel(): Unit = {
isRunning = false
}
}
How can I make sure to only get the rows of the database which aren't processed yet?
I assumed the ctx variable would have some information about the current watermark so that I could change my query to something like:
select name, timestamp from myTable where timestamp > ctx.getCurrentWaterMark
But it doesn't have any relevant methods for me. Any Ideas how to solve this problem would be appreciated
You have to implement CheckpointedFunction so that you can manage checkpointing by yourself. The documentation of the interface is pretty comprehensive but if you need an example I advise you to take a look at an example.
In essence, your function must implement CheckpointedFunction#snapshotState to store the state you need using Flink's managed state and then, when performing a restore, it will read that same state in CheckpointedFunction#initializeState.

Resources