Memory is not coming down after data processing in Apache Flink - apache-flink

I am using broadcastprocess function to perform simple pattern matching. I am broadcasting around 60 patterns. Once the process completed the memory is not coming down i am using garbage collection setting in my flink configuration file env.java.opts = "-XX:+UseG1GC" to perform GC but it is also not working. But CPU percentage coming after completing the processing of data. I am doing checkpointing every 2 minutes and my statebackend is filesystem. Below are screenshots of memory and CPU usage

I don't see anything surprising or problematic in the graphs you have shared. After ingesting the patterns, each instance of your BroadcastProcessFunction will be holding onto a copy of all of the patterns -- so that will consume some memory.
If I understand correctly, it sounds like the situation is that as data is processed for matching against those patterns, the memory continues to increase until the pods crash with out-of-memory errors. Various factors might explain this:
If your patterns involve matching a sequence of events over time, then your pattern matching engine has to keep state for each partial match. If there's no timeout clause to ensure that partial matches are eventually cleaned up, this could lead to a combinatorial explosion.
If you are doing key-partitioned processing and your keyspace is unbounded, you may be holding onto state for stale keys.
The filesystem state backend has considerable overhead. You may have underestimated how much memory it needs.

Related

Flink checkpoints keeps failing

we are trying to setup a Flink stateful job using RocksDB backend.
We are using session window, with 30mins gap. We use aggregateFunction, so not using any Flink state variables.
With sampling, we have less than 20k events/s, 20 - 30 new sessions/s. Our session basically gather all the events. the size of the session accumulator would go up along time.
We are using 10G memory in total with Flink 1.9, 128 containers.
Following's the settings:
state.backend: rocksdb
state.checkpoints.dir: hdfs://nameservice0/myjob/path
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
state.backend.rocksdb.memory.write-buffer-ratio: 0.4
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1
containerized.heap-cutoff-ratio: 0.45
taskmanager.network.memory.fraction: 0.5
taskmanager.network.memory.min: 512mb
taskmanager.network.memory.max: 2560mb
From our monitoring of a given time,
rocksdb memtable size is less than 10m,
Our heap usage is less than 1G, but our direct memory usage (network buffer) is using 2.5G. The buffer pool/ buffer usage metrics are all at 1 (full).
Our checkpoints keep failing,
I wonder if it's normal that the network buffer part could use up this much memory?
I'd really appreciate if you can give some suggestions:)
Thank you!
For what it's worth, session windows do use Flink state internally. (So do most sources and sinks.) Depending on how you are gathering the session events into the session accumulator, this could be a performance problem. If you need to gather all of the events together, why are you doing this with an AggregateFunction, rather than having Flink do this for you?
For the best windowing performance, you want to use a ReduceFunction or an AggregateFunction that incrementally reduces/aggregates the window, keeping only a small bit of state that will ultimately be the result of the window. If, on the other hand, you use only a ProcessWindowFunction without pre-aggregation, then Flink will internally use an appending list state object that when used with RocksDB is very efficient -- it only has to serialize each event to append it to the end of the list. When the window is ultimately triggered, the list is delivered to you as an Iterable that is deserialized in chunks. On the other hand, if you roll your own solution with an AggregateFunction, you may have RocksDB deserializing and reserializing the accumulator on every access/update. This can become very expensive, and may explain why the checkpoints are failing.
Another interesting fact you've shared is that the buffer pool / buffer usage metrics show that they are fully utilized. This is an indication of significant backpressure, which in turn would explain why the checkpoints are failing. Checkpointing relies on the checkpoint barriers being able to traverse the entire execution graph, checkpointing each operator as they go, and completing a full sweep of the job before timing out. With backpressure, this can fail.
The most common cause of backpressure is under-provisioning -- or in other words, overwhelming the cluster. The network buffer pools become fully utilized because the operators can't keep up. The answer is not to increase buffering, but to remove/fix the bottleneck.

SOLR Streaming VS Search

What are the key differences between streaming and normal search in terms of internal implementation . A normal search would also work in a distributed manner , right ? How does streaming improve performance ? The documentation is not helping out .
Distributed search issues requests, calculates results, delivers results and handles merging. Each step is processed fully before continuing to the next step. This works well enough when the amount of data is small. For larger requests, such as delivering millions of documents, this requires huge memory buffers. It also means that the caller has to wait until the very last step (delivery of the result to the caller) before the result can be processed.
With streaming, all this is ongoing: Calculation, delivery and merging happens at the same time, with a fixed upper memory overhead. You can ask for 10K results or you can ask for 10 billion, the only difference is how long it takes. As every part of the process is active at the same time, including delivery to the caller, this also means that the first result data will be delivered to the caller very quickly.
Internally, streaming is basically paging through the search result. Each page (10K documents, if I remember correctly) is passed along to the stream as soon as it has been calculated. Ignoring optimizations, the same behaviour could be emulated from the outside with deep paging and a custom merger.

ndb data contention getting worse and worse

I have a bit of a strange problem. I have a module running on gae that puts a whole lot of little tasks on the default task queue. The tasks access the same ndb module. Each task accesses a bunch of data from a few different tables then calls put.
The first few tasks work fine but as time continues I start getting these on the final put:
suspended generator _put_tasklet(context.py:358) raised TransactionFailedError(too much contention on these datastore entities. please try again.)
So I wrapped the put with a try and put in a randomised timeout so it retries a couple of times. This mitigated the problem a little, it just happens later on.
Here is some pseudocode for my task:
def my_task(request):
stuff = get_ndb_instances() #this accessed a few things from different tables
better_stuff = process(ndb_instances) #pretty much just a summation
try_put(better_stuff)
return {'status':'Groovy'}
def try_put(oInstance,iCountdown=10):
if iCountdown<1:
return oInstance.put()
try:
return oInstance.put()
except:
import time
import random
logger.info("sleeping")
time.sleep(random.random()*20)
return oInstance.try_put(iCountdown-1)
Without using try_put the queue gets about 30% of the way through until it stops working. With the try_put it gets further, like 60%.
Could it be that a task is holding onto ndb connections after it has completed somehow? I'm not making explicit use of transactions.
EDIT:
there seems to be some confusion about what I'm asking. The question is: Why does ndb contention get worse as time goes on. I have a whole lot of tasks running simultaneously and they access the ndb in a way that can cause contention. If contention is detected then a randomy timed retry happens and this eliminates contention perfectly well. For a little while. Tasks keep running and completing and the more that successfully return the more contention happens. Even though the processes using the contended upon data should be finished. Is there something going on that's holding onto datastore handles that shouldn't be? What's going on?
EDIT2:
Here is a little bit about the key structures in play:
My ndb models sit in a hierarchy where we have something like this (the direction of the arrows specifies parent child relationships, ie: Type has a bunch of child Instances etc)
Type->Instance->Position
The ids of the Positions are limited to a few different names, there are many thousands of instances and not many types.
I calculate a bunch of Positions and then do a try_put_multi (similar to try_put in an obvious way) and get contention. I'm going to run the code again pretty soon and get a full traceback to include here.
Contention will get worse overtime if you continually exceed the 1 write/transaction per entity group per second. The answer is in how Megastore/Paxo work and how Cloud Datastore handles contention in the backend.
When 2 writes are attempted at the same time on different nodes in Megastore, one transaction will win and the other will fail. Cloud Datastore detects this contention and will retry the failed transaction several times. Usually this results in the transaction succeeding without any errors being raised to the client.
If sustained writes above the recommended limit are being attempted, the chance that a transaction needs to be retried multiple times increases. The number of transactions in an internal retry state also increases. Eventually, transactions will start reaching our internal retry limit and will return a contention error to the client.
Randomized sleep method is an incorrect way to handle error response situations. You should instead look into exponential back-off with jitter (example).
Similarly, the core of your problem is a high write rate into a single entity group. you should look into whether the explicit parenting is required (removing it if not), or if you should shard the entity group in some manner that makes sense according to your queries and consistency requirements.

Integration of non-parallelizable task with high memory demands in Flink pipeline

I am using Flink in a Yarn Cluster to process data using various sources and sinks. At some point in the topology, there is an operation that cannot be parallelized and furthermore needs access to a lot of memory. In fact, the API I am using for this step needs its input in array-form. Right now, I have implemented it something like
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Pojo> input = ...
List<Pojo> inputList = input.collect();
Pojo[] inputArray = inputList.toArray();
Pojo[] resultArray = costlyOperation(inputArray);
List<Pojo> resultList = Arrays.asList(resultArray);
DataSet<Pojo> result = env.fromCollection(resultList);
result.otherStuff()
This solution seems rather unnatural. Is there a straight-forward way to incorporate this task into my Flink pipeline?
I have read in another thread that the collect() function should not be used for large datasets. I believe the fact that collecting the dataset into a list and then an array does not happen parallely is not my biggest problem right now, but would you still prefer to write what I called input above into a file and build an array from that?
I have also seen the options to configure managed memory in flink. In principle, it might be possible to tune this in a way so that enough heap is left for the expensive operation. On the other hand, I am afraid that the performance of all the other operators in the topology might suffer. What is your opinion on this?
You could replace the "collect->array->costlyOperation->array->fromCollection" step by a key-less reduce operation with a surrogate key that has a unique value for all tuples such that you get only a single partition. This would be Flink like.
In your costly operation itself, that is implemented as a GroupReduceFunction, you will get an iterator over the data. If you do not need to access all data "at once", you also safe heap space as you do not need to keep all data in-memory within reduce (but this depends of course what your costly operation computes).
As an alternative, you could also call reduce() without a previous groupBy(). However, you do not get an iterator or an output collector and can only compute partial aggregates. (see "Reduce" in https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations)
Using Flink style operations has the advantage, that the data is kept in the cluster. If you do collect() the result is transfered to the client, the costly operation is executed in the client, and the result is transfered back to the cluster. Furthermore, if the input is large, Flink will automatically spill the intermediate result to disc for you.

Handling multiple calls to BeginExecuteNonQuery in SQL Server 2008

I have an application that is receiving a high volume of data that I want to store in a database. My current strategy is to fire off an asynchronous call (BeginExecuteNonQuery) with each record when it's ready. I'm using the asynchronous call to ensure that the rest of the application runs smoothly.
The problem I have is that as the volume of data increases, eventually I get to the point where I'm trying to fire a command down the connection while it's still in use. I can see two possible options:
Buffer the pending data myself until the existing command is finished.
Open multiple connections as needed.
I'm not sure which of these options is best, or if in fact there is a better way. Option 1 will probably lead to my buffer getting bigger and bigger, while option 2 may be very bad form - I just don't know.
Any help would be appreciated.
Depending on your locking strategy, it may be worth using several connections but certainly not a number "without upper bounds". So a good strategy/pattern to use here is "thread pool", with each of N dedicated threads holding a connection and picking up write requests as the requests come and the thread finishes the previous one it was doing. Number of threads in the pool for best performance is best determined empirically, by benchmarking various possibilities in a realistic experimental/prototype setting.
If the "buffer" queue (in which your main thread queues write requests and the dedicated threads in the pool picks them up) grows beyond a certain threshold, it means you're getting data faster than you can possibly write it out, so, unless you can get more resources, you'll simply have to drop some of the incoming data -- maybe by a random-sampling strategy to avoid biasing future statistical analysis. Just count how much you're writing and how much you're having to drop due to the resource shortage in each period of time (say every minute or so), so you can use "stratified sampling" techniques in future data-mining explorations.
Thanks Alex - so you'd suggest a hybrid method then, assuming that I'll still need to buffer updates if all connections are in use?
(I'm the original poster, I've just managed to get two accounts without realizing)

Resources