Flink rocksdb buffers and cache do not work

Flink rocksdb buffers and cache do not work - apache-flink

I have a 3K message/sec system. A lot of states and windows in the flow.
At the moment my management memory usage is only 84MB although I reserved 15GB for it. On Flink web UI it says 84.4 MB / 14.8 GB.
Rocksdb doesnt use it for cache and buffers. Can you help why?
Below you can see my config.
taskmanager.memory.process.size: 51912m
taskmanager.memory.managed.fraction : 0.2
taskmanager.numberOfTaskSlots: 360
state.backend: rocksdb
state.backend.rocksdb.localdir: /home/asi/rockdbtmp/datadir
state.backend.rocksdb.thread.num: 4
state.backend.rocksdb.log.dir: /home/asi/flink-1.15.2/log/rocksdb
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
state.backend.rocksdb.memory.write-buffer-ratio: 0.5
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1
Also, If if use hashmap backend, my system works without any problem but if I change it to rocksdb, it locks in couple of seconds. I think this is also related with this buffer problem.
thanks

the problem was numberOfTaskSlots.
so total managed mem is 27GB and taskSlot is 360.
So a task uses only 27GB/360 memory. ijust changed it to 36

Related

Wrk vs Gatling benchmark test comparison

With wrk, I runt the following command :
wrk -t10 -c10 -d30s http://localhost:8080/myService --latency -H "Accept-Encoding: gzip"
As a result, I obtain Requests/sec: 15000 and no error
I am trying to reproduce the same kind of test with Gatling. So I have tried the following :
scn.inject(
rampUsersPerSec(1) to 15000 during (30 seconds)
)
But as a result, I obtain errors :
---- Errors --------------------------------------------------------------------
i.n.c.AbstractChannel$AnnotatedSocketException: Can't assign r 573 (42,44%)
equested address: localhost/127.0.0.1:8080
i.n.c.AbstractChannel$AnnotatedSocketException: Resource tempo 530 (39,26%)
rarily unavailable: localhost/0:0:0:0:0:0:0:1:8080
j.i.IOException: Premature close 247 (18,30%)
From wrk, I believe my server can handle 15000 request/s but with Gatling it seems not the case. Do you have an idea why such a difference ?

Disclaimer: Gatling's creator here
You're comparing apples and oranges.
With wrk, you're opening 10 connections and looping as fast as possible during 30s.
With your current Gatling set up, you're spawning 225,015 virtual users ((1 + 15,000) / 2 * 30), each one trying to open its own connection.
I recommend you reading this article about picking injection profiles that make sense for your use case.
If you really want to do the same thing as wrk here, you need to wrap your scenario in a during(30) loop and change your injection profile to atOnceUsers(10).
You also have the option of using a shared connection pool.
Then, you can't expect any other to load test tool to be as fast as wrk for this kind of logicless, static test.
Also note that:
there was a mistake in Gatling's JVM configuration that was fixed in Gatling 3.4.0 that hurt performance in this kind of minimalistic
super high throughput tests, see issue
Gatling runs on a JVM, hence with a runtime, so it needs to warm up, boot throughput will be lower than the warm one

Apache Flink and Apache Pulsar

I am using Flink to read data from Apache Pulsar.
I have a partitioned topic in pulsar with 8 partitions.
I produced 1000 messages in this topic, distributed across the 8 partitions.
I have 8 cores in my laptop, so I have 8 sub-tasks (by default parallelism = # of cores).
I opened the Flink-UI after executing the code from Eclipse, I found that some sub-tasks are not receiving any records (idle).
I am expecting that all the 8 sub-tasks will be utilized (I am expecting that each sub-task will be mapped to one partition in my topic).
After restarting the job, I found that some times 3 sub-takes are utilized and some times 4 tasks are utilized while the remaining sub-tasks kept idle.
please your support to clarify this scenario.
Also how can I know that there is a shuffle between sub-takes or not?
My Code:
ConsumerConfigurationData<String> consumerConfigurationData = new ConsumerConfigurationData<>();
Set<String> topicsSet = new HashSet<>();
topicsSet.add("flink-08");
consumerConfigurationData.setTopicNames(topicsSet);
consumerConfigurationData.setSubscriptionName("my-sub0111");
consumerConfigurationData.setSubscriptionType(SubscriptionType.Key_Shared);
consumerConfigurationData.setConsumerName("consumer-01");
consumerConfigurationData.setSubscriptionInitialPosition(SubscriptionInitialPosition.Earliest);
PulsarSourceBuilder<String> builder = PulsarSourceBuilder.builder(new SimpleStringSchema()).pulsarAllConsumerConf(consumerConfigurationData).serviceUrl("pulsar://localhost:6650");
SourceFunction<String> src = builder.build();
DataStream<String> stream = env.addSource(src);
stream.print(" >>> ");

For the Pulsar question, I don't know enough to help. I recommend setting up a larger test and see how that turns out. Usually, you'd have more partitions than slots and have some slots consume several partitions in a somewhat random fashion.
Also how can I know that there is a shuffle between sub-takes or not?
The easiest way is to look at the topology of the Flink Web UI. There you should see the number of tasks and the channel types. You could post a screenshot if you want more details but in this case, there is nothing that will be shuffled, since you only have a source and a sink.

Apache Flink not deleting old checkpoints

I have a very simple setup of 4-node Flink cluster where one of nodes is Jobmanager, others are Taskmanagers and started by start-cluster script.
All task managers have the same configuration, regarding state and checkpointing it's as follows:
state.backend: rocksdb
state.backend.fs.checkpointdir: file:///root/flink-1.3.1/checkpoints/fs
state.backend.rocksdb.checkpointdir: file:///root/flink-1.3.1/checkpoints/rocksdb
# state.checkpoints.dir: file:///root/flink-1.3.1/checkpoints/metadata
# state.checkpoints.num-retained: 2
(The latter 2 options are commented intentionally as I tried uncommenting them and it didn't change a thing.)
And in code I have:
val streamEnv = StreamExecutionEnvironment.getExecutionEnvironment
streamEnv.enableCheckpointing(10.minutes.toMillis)
streamEnv.getCheckpointConfig.setCheckpointTimeout(1.minute.toMillis)
streamEnv.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
After job is working for 40 minutes, in directory
/root/flink-1.3.1/checkpoints/fs/.../
I see 4 checkpoint directories with name pattern "chk-" + index, whereas I expected that old checkpoints would be deleted and there would be only one checkpoint left.(from the docs, only one checkpoint by default should be retained) Meanwhile, in web UI Flink marks first three checkpoints as "discarded".
Did I configure anything wrong or it's an expected behaviour?

The deletion is done by the job manager, which probably has no way of accessing your files (in /root)

SolrCloud becoming slow over time

I have a 3 node SolrCloud setup (replication factor 3), running on Ubuntu 14.04 Solr 6.0 on SSDs. Much indexing taking place, only softCommits. After some time, indexing speed becomes really slow, but when i restart the solr service on the node that became slow, everything gets back to normal. Problem is that i need to guess which node becomes slow.
I have 5 collections, but only one collection (mostly used) is getting slow. Total data size is 144G including tlogs.
Said core/collection is 99G including tlogs, tlog is just 313M. Heap size is 16G, Total memory is 32G, data is stored on SSD. Every node is configured the same.
What appears to be strange is that i have literally hundreds or thousands of log lines per second on both slaves when this hits:
2016-09-16 10:00:30.476 INFO (qtp1190524793-46733) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1] webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[ka2PZAqO_ (1545622027473256450)]} 0 0
2016-09-16 10:00:30.477 INFO (qtp1190524793-46767) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1] webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[nlFpoYNt_ (1545622027474305024)]} 0 0
2016-09-16 10:00:30.477 INFO (qtp1190524793-46766) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1] webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[tclMjXH6_ (1545622027474305025), 98OPJ3EJ_ (1545622027476402176)]} 0 0
2016-09-16 10:00:30.478 INFO (qtp1190524793-46668) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1] webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[btceXK4M_ (1545622027475353600)]} 0 0
2016-09-16 10:00:30.479 INFO (qtp1190524793-46799) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1] webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[3ndK3HzB_ (1545622027476402177), riCqrwPE_ (1545622027477450753)]} 0 1
2016-09-16 10:00:30.479 INFO (qtp1190524793-46820) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1] webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[wr5k3mfk_ (1545622027477450752)]} 0 0
In this case 192.168.0.3 is the master.
My workflow is that i insert batches of 2500 docs with ~10 threads at the same time which works perfectly fine for most of the time but sometimes it becomes slow as described. Ocassionally there are updates / indexing calls from other sources, but it's less than a percent.
UPDATE
Complete config (output from Config API) is http://pastebin.com/GtUdGPLG
UPDATE 2
These are the command line args:
-DSTOP.KEY=solrrocks
-DSTOP.PORT=7983
-Dhost=192.168.0.1
-Djetty.home=/opt/solr/server
-Djetty.port=8983
-Dlog4j.configuration=file:/var/solr/log4j.properties
-Dsolr.install.dir=/opt/solr
-Dsolr.solr.home=/var/solr/data
-Duser.timezone=UTC
-DzkClientTimeout=15000
-DzkHost=192.168.0.1:2181,192.168.0.2:2181,192.168.0.3:2181
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90-Xloggc:/var/solr/logs/solr_gc.log
-Xms16G
-Xmx16G
-Xss256k
-verbose:gc
UPDATE 3
Happened again, these are some Sematext Graphs:
Sematext Dashboard for Master:
Sematext Dashboard for Secondary 1:
Sematext Dashboard for Secondary 2:
Sematext GC for Master:
Sematext GC for Secondary 1:
Sematext GC for Secondary 2:
UPDATE 4 (2018-01-10)
This is a quite old question, but i recently discovered that someone installed a cryptocoin miner on all of my solr machines using CVE-2017-12629 which i fixed with an upgrade to 6.6.2.
If you're not sure if your system is infiltrated check the processes for user solr using ps aux | grep solr. If you see two or more processes, especially a non-java process, you might be running a miner.

So you're seeing disk I/O hitting 100% during indexing with a high-write throughput application.
There are two major drivers of disk I/O with Solr indexing:
Flushing in-memory index segments to disk.
Merging disk segments into new larger segments.
If your indexer isn't directly calling commit as a part of the indexing process (and you should make sure it isn't), Solr will flush index segments to disk based on your current settings:
Every time your RAM buffer fills up ("ramBufferSizeMB":100.0)
Based on your 3 min hard commit policy ("maxTime":180000)
If your indexer isn't directly calling optimize as a part of the indexing process (and you should make sure it isn't), Solr will periodically merge index segments on disk based on your current settings (the default merge policy):
mergeFactor: 10, or roughly each time the number of on-disk index segments exceeds 10.
Based on the way you've described your indexing process:
2500 doc batches per thread x 10 parallel threads
... you could probably get away with a larger RAM buffer, to yield larger initial index segments (that are then flushed to disk less frequently).
However the fact that your indexing process
works perfectly fine for most of the time but sometimes it becomes slow
... makes me wonder if you're just seeing the effect of a large merge happening in the background, and cannibalizing system resources needed for fast indexing at that moment.
Ideas
You could experiment with a larger mergeFactor (e.g. 25). This will reduce the frequency of background index segment merges, but not the resource drain when they happen. (Also, be aware that more index segments often translates to worse query performance).
In the indexConfig, you can try overriding the default settings for the ConcurrentMergeScheduler to throttle the number of merges that can be running at one time (maxMergeCount), and/or throttle the number of threads that can be used for merges (maxThreadCount), based on the system resources you're willing to make available.
You could increase your ramBufferSizeMB. This will reduce the frequency of in-memory index segments being flushed to disk, also serving to slow down the merge cadence.
If you are not relying on Solr for durability, you'll want /var/solr/data pointing to a local SSD volume. If you're going over a network mount (this has been documented with Amazon's EBS), there is a significant write throughput penalty, up to 10x less than writing to ephemeral/local storage.

Do you have the CPU load of each core of the master and not only the combined CPU graph ? What I noticed is when I index with Solr when Xmx is too small (could be the case if you have 144GB data and Xmx=16GB), when the indexing progresses, merging will take more and more time.
During merging, typically one core=100% CPU and other cores do nothing.
Your master combined CPU graph looks like that: only 20% combined load during sequences.
So, check that the merge factor is a reasonable value (between 10 and 20 or something) and potentially raise Xmx.
That's the two things I would play with to start with.
Question: you don't have anything special with your analyzers (custom tokenizer, etc) ?

Solr 3.5 indexing taking very long

We recently migrated from solr3.1 to solr3.5, we have one master and one slave configured. The master has two cores,
1) Core1 – 44555972 documents
2) Core2 – 29419244 documents
We commit every 5000 documents, but lately the commit is taking very long 15 minutes plus in some cases. What could have caused this, I have checked the logs and the only warning i can see is,
“WARNING: Use of deprecated update request parameter update.processor detected. Please use the new parameter update.chain instead, as support for update.processor will be removed in a later version.”
Memory details:
export JAVA_OPTS="$JAVA_OPTS -Xms6g -Xmx36g -XX:MaxPermSize=5g"
Solr Config:
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>
<!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
Also noticed, that top command show almost 350GB of Virtual memory usage.
What could be causing this, as everything was running fine a few days back?

Do you have a large search warming query? Our commits take upto 2 mins because of search warming in place. Wondering if that is the case.
The large virtual memory usage would explain this.