flink task manager running into OOM issue intermittently - apache-flink

I have 7 task manager running streaming application with following configuration.
state.backend.rockdb.memory.managed: false
taskmanager.memory.process.size: 2560MB
taskmanager.memory.framework.heap.size: 128MB
taskmanager.memory.task.heap.size: 1024MB
taskmanager.memory.managed.size: 128MB
taskmanager.memory.framework.off-heap.size: 128MB
taskmanager.memory.task.off-heap.size: 256MB
taskmanager.memory.network.min: 128MB
taskmanager.memory.network.max: 128MB
taskmanager.memory.jvm-overhead.max: 256MB
taskmanager.memory.jvm-overhead.min: 256MB
taskmanager.memory.jvm-metaspace.size: 256MB
taskmanager.cpu.cores: 2
taskmanager.heap.mb: ""
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.preallocate: false
TaskManager which reads unbounded source is throwing OOM error intermittently. Somestimes it is processing fine even with 2M records and sometimes throwing error for 30K records as well. Heap & non Heap Memory is always under the Max.
gc stats
[7526.694s][info ][gc,start ] GC(113) Pause Young (Normal) (G1 Evacuation Pause)
[7526.694s][info ][gc,task ] GC(113) Using 9 workers of 9 for evacuation
[7526.708s][info ][gc,phases ] GC(113) Pre Evacuate Collection Set: 0.1ms
[7526.708s][info ][gc,phases ] GC(113) Evacuate Collection Set: 11.5ms
[7526.708s][info ][gc,phases ] GC(113) Post Evacuate Collection Set: 2.0ms
[7526.708s][info ][gc,phases ] GC(113) Other: 0.4ms
[7526.708s][info ][gc,heap ] GC(113) Eden regions: 594->0(594)
[7526.708s][info ][gc,heap ] GC(113) Survivor regions: 20->20(77)
[7526.708s][info ][gc,heap ] GC(113) Old regions: 135->135
[7526.708s][info ][gc,heap ] GC(113) Humongous regions: 49->49
[7526.708s][info ][gc,metaspace ] GC(113) Metaspace: 173013K(180072K)->173013K(180072K) NonClass: 151874K(157032K)->151874K(157032K) Class: 21138K(23040K)->21138K(23040K)
[7526.708s][info ][gc ] GC(113) Pause Young (Normal) (G1 Evacuation Pause) 797M->202M(1024M) 14.074ms
Couldn't figure out what's causing the OOM error.
container status is Terminated at Jan 19, 2023 11:03:23 PM with exit code 137 (OOMKilled)
Thanks for help

Related

My Postgres Master POD takes maximum CPU Cores which crashes the entire dependent PODS

When I'm running my Backend and Posrgres Master POD is given 25 Cores
resources:
limits:
cpu: 25000m
memory: 140Gi
requests:
cpu: 25000m
After around 3-4 Hours of running the Postgres Master POD starts to take a lot of CPU until it reaches the Maximum
Here are my Postgres Settings:
effective_cache_size: "105GB"
effective_io_concurrency: "200"
listen_addresses: '*'
log_destination: "stderr"
logging_collector: "false"
log_line_prefix: '%t [%p]: [%l-1] [trx_id=%x] user=%u,db=%d'
log_min_error_statement: "DEBUG1"
log_error_verbosity: "verbose"
maintenance_work_mem: "2GB"
max_connections: "4000"
max_wal_size: "16GB"
min_wal_size: "4GB"
max_worker_processes: "18"
max_parallel_workers_per_gather: "9"
max_parallel_workers: "18"
max_parallel_maintenance_workers: "4"
random_page_cost: "1.1"
shared_bufIfers: "35GB"
shared_preload_libraries: "pg_stat_statements"
pg_stat_statements.track: "all"
log_statement: "all"
synchronous_commit: "false"
syslog_facility: "LOCAL0"
syslog_ident: "postgres"
syslog_sequence_numbers: "true"
syslog_split_messages: "true"
wal_buffers: "16MB"
work_mem: "500MB"
checkpoint_completion_target: "0.9"
idle_in_transaction_session_timeout: "20000"
statement_timeout: "60000"
Also ,when checking number of idle/active processes ,we get:
Any idea how to fix this ? What might cause this high CPU Postgres usage ?

Does anyone know why the Azure Kinect DK crashes in the viewer when WFOW(wide field of view) is used as depth configuration?

When the WFOV(wide field of view) depth configuration is used, the Azure kinect is failing in Kinect viewer. We tried to update the firmware but it did not work. Everytime after opening the kinect viewer for couple of seconds we receive the following errors in the viewer when crashing:
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 307733 type:Color
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 374388 type:Color
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 441066 type:Color
[ warning ] : usb_cmd_libusb_cb(). USB timeout on streaming endpoint for depth
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 507733 type:Color
[ warning ] : usb_cmd_libusb_cb(). USB timeout on streaming endpoint for depth
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 574411 type:Color
[ warning ] : usb_cmd_libusb_cb(). USB timeout on streaming endpoint for depth
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 641077 type:Color
[ warning ] : usb_cmd_libusb_cb(). USB timeout on streaming endpoint for depth
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 707722 type:Color
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 774400 type:Color
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 841055 type:Color
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 907733 type:Color
[ warning ] : usb_cmd_libusb_cb(). USB timeout on streaming endpoint for depth
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 974388 type:Color
[ warning ] : usb_cmd_libusb_cb(). USB timeout on streaming endpoint for depth
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 1041066 type:Color
[ warning ] : usb_cmd_libusb_cb(). USB timeout on streaming endpoint for depth
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 1107733 type:Color
[ warning ] : usb_cmd_libusb_cb(). USB timeout on streaming endpoint for depth
[ error ] : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 1174411 type:Color
[ error ] : cmd_status == CMD_STATUS_PASS returned failure in depthmcu_depth_stop_streaming()
[ error ] : depthmcu_depth_stop_streaming(). ERROR: cmd_status=0x00000063
[ error ] : cmd_status == CMD_STATUS_PASS returned failure in depthmcu_depth_stop_streaming()
[ error ] : depthmcu_depth_stop_streaming(). ERROR: cmd_status=0x00000063
Does anyone know why we are having these errors in its own viewer?
Thank you

Apache Solr : Data import handler exception - SetGraphicsStateParameters name for 'gs' operator not found in resources: /R7

configured data import handler to process bulk PDF documents. after process 21000 documents. Process going to idle and not processing all the documents.
When i see the log observed below things.
Please let me know is there anyway that i can ignore this issue or any setting do i need to update.
Error:
2020-04-23 18:39:55.749 INFO (qtp215219944-24) [ x:DMS] o.a.s.c.S.Request [DMS] webapp=/solr path=/dataimport params={indent=on&wt=json&command=status&_=1587664092295} status=0 QTime=0
2020-04-23 18:39:55.972 WARN (Thread-14) [ ] o.a.p.p.COSParser **The end of the stream is out of range, using workaround to read the stream, stream start position: 4748210, length: 2007324, expected end position: 6755534**
2020-04-23 18:39:55.976 WARN (Thread-14) [ ] o.a.p.p.COSParser Removed null object COSObject{50, 0} from pages dictionary
2020-04-23 18:39:55.976 WARN (Thread-14) [ ] o.a.p.p.COSParser Removed null object COSObject{60, 0} from pages dictionary
2020-04-23 18:39:55.997 ERROR (Thread-14) [ ] o.a.p.c.o.s.SetGraphicsStateParameters **name for 'gs' operator not found in resources: /R7**
No Unicode mapping for 198 (1) in font DDJQSL+Wingdings
Regards,
Ravi kumar
After analyzing the documents observed that someof the documents size is more than 500mb.
So solr getting out of memory exception and need to increase the heap memory.
After doing that issue got resolved.

Can't start clickhouse service, too many files in ../data/default/<TableName>

I have a strange problem with my standalone clickhouse-server installation. Server was running for some time with nearly default config, except data and tmp directories was replaced to separate disk:
cat /etc/clickhouse-server/config.d/my_config.xml
<?xml version="1.0"?>
<yandex>
<path>/data/clickhouse/</path>
<tmp_path>/data/clickhouse/tmp/</tmp_path>
</yandex>
Today the server stopped responding with connection refused error. It was rebooted and after that the service couldn't completely start:
2018.05.28 13:15:44.248373 [ 2 ] <Information> DatabaseOrdinary (default): 42.86%
2018.05.28 13:15:44.259860 [ 2 ] <Debug> default.event_4648 (Data): Loading data parts
2018.05.28 13:16:02.531851 [ 2 ] <Debug> default.event_4648 (Data): Loaded data parts (2168 items)
2018.05.28 13:16:02.532130 [ 2 ] <Information> DatabaseOrdinary (default): 57.14%
2018.05.28 13:16:02.534622 [ 2 ] <Debug> default.event_5156 (Data): Loading data parts
2018.05.28 13:34:01.731053 [ 3 ] <Information> Application: Received termination signal (Terminated)
Really, I stopped process on 57%, because it starts too long(maybe it could start in an hour or two, I didn't try).
The log level by default is "trace", but I didn't show any reasons of such behavior.
I think the problem is in file count in /data/clickhouse/data/default/event_5156.
Now it is 626023 directories in it and ls -la command do not work in this catalog properly, I have to use find to count files:
# time find . -maxdepth 1 | wc -l
626023
real 5m0.302s
user 0m3.114s
sys 0m24.848s
I have two questions:
1)Why Clickhouse-Server generated so much files and directories, with default config?
2)How can I start the service without data loss in adequate time?
Issue was in data update method. I used script with jdbc connector and have been sending one string per request. After changing scheme to batch update, the issue was solved.

Mongodb shard balance not work properly, with a lot of moveChunk error reported

We have a mongoDb cluster with 3 shards, each shard is a replica set contains 3 nodes, the mongoDb version we use is 3.2.6. we have a big database with size about 230G, which contains about 5500 collections. we found that about 2300 collections are not balanced where other 3200 collections are evenly distributed to 3 shards.
below is the result of sh.status (the whole result is too big, i just post part of them):
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("57557345fa5a196a00b7c77a")
}
shards:
{ "_id" : "shard1", "host" : "shard1/10.25.8.151:27018,10.25.8.159:27018" }
{ "_id" : "shard2", "host" : "shard2/10.25.2.6:27018,10.25.8.178:27018" }
{ "_id" : "shard3", "host" : "shard3/10.25.2.19:27018,10.47.102.176:27018" }
active mongoses:
"3.2.6" : 1
balancer:
Currently enabled: yes
Currently running: yes
Balancer lock taken at Sat Sep 03 2016 09:58:58 GMT+0800 (CST) by iZ23vbzyrjiZ:27017:1467949335:-2109714153:Balancer
Collections with active migrations:
bdtt.normal_20131017 started at Sun Sep 18 2016 17:03:11 GMT+0800 (CST)
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
1490 : Failed with error 'aborted', from shard2 to shard3
1490 : Failed with error 'aborted', from shard2 to shard1
14 : Failed with error 'data transfer error', from shard2 to shard1
databases:
{ "_id" : "bdtt", "primary" : "shard2", "partitioned" : true }
bdtt.normal_20160908
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard2 142
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160909
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard1 36
shard2 42
shard3 46
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160910
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard1 34
shard2 32
shard3 32
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160911
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard1 30
shard2 32
shard3 32
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160912
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard2 126
too many chunks to print, use verbose if you want to force print
bdtt.normal_20160913
shard key: { "_id" : "hashed" }
unique: false
balancing: true
chunks:
shard2 118
too many chunks to print, use verbose if you want to force print
}
Collection "normal_20160913" is not balanced, I post the getShardDistribution() result of this collection below:
mongos> db.normal_20160913.getShardDistribution()
Shard shard2 at shard2/10.25.2.6:27018,10.25.8.178:27018
data : 4.77GiB docs : 203776 chunks : 118
estimated data per chunk : 41.43MiB
estimated docs per chunk : 1726
Totals
data : 4.77GiB docs : 203776 chunks : 118
Shard shard2 contains 100% data, 100% docs in cluster, avg obj size on shard : 24KiB
the balancer process is in running status, and the chunksize is default(64M):
mongos> sh.isBalancerRunning()
true
mongos> use config
switched to db config
mongos> db.settings.find()
{ "_id" : "chunksize", "value" : NumberLong(64) }
{ "_id" : "balancer", "stopped" : false }
And I found a lot of moveChunk error from mogos log, which might be the reason why some of the collections not well balanced, here is the latest part of them:
2016-09-19T14:25:25.427+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:25:59.620+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:25:59.644+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:35:02.701+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:35:02.728+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:18.232+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:18.256+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:27.101+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:27.112+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:43:41.889+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
I tried use moveChunk command manually, it's returns same error:
mongos> sh.moveChunk("bdtt.normal_20160913", {_id:ObjectId("57d6d107edac9244b6048e65")}, "shard3")
{
"cause" : {
"ok" : 0,
"errmsg" : "Not starting chunk migration because another migration is already in progress",
"code" : 117
},
"code" : 117,
"ok" : 0,
"errmsg" : "move failed"
}
I am not sure if too many collections created which cause migration overwhelmed? each day about 60-80 new collections will created.
I need help here to answer below questions, any hints will be great:
Why some of the collections not balanced, is it related to the big number of newly created collections?
Is there any command can check the processing migration jobs details? I got a lot of error log which shows some migration jog is running, but I can not find which is running.
Answer my own question:
Finally we found the root cause, it's an exactly same issue with this one "MongoDB balancer timeout with delayed replica", caused by abnormal replica set config.
When this issue happens, our replica set configuration as below:
shard1:PRIMARY> rs.conf()
{
"_id" : "shard1",
"version" : 3,
"protocolVersion" : NumberLong(1),
"members" : [
{
"_id" : 0,
"host" : "10.25.8.151:27018",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 1,
"host" : "10.25.8.159:27018",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 2,
"host" : "10.25.2.6:37018",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 3,
"host" : "10.47.114.174:27018",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : true,
"priority" : 0,
"tags" : {
},
"slaveDelay" : NumberLong(86400),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("5755464f789c6cd79746ad62")
}
}
There are 4 nodes inside the replica set: one primary, one slave, one arbiter and one 24 hours delayed slave. that makes 3 nodes to be majority, since arbiter have no data present, balancer need to wait the delayed slave to satisfy the write concern(make sure the receiver shard have received the chunk).
There are several ways to solve the problem. We just removed the arbiter, the balancer works fine now.
I'm going to speculate but my guess is that your collections are very imbalanced and are currently being balanced by chunk migration (It might take a long time). Hence your manual chunk migration is queued but not executed right away.
Here are a few points that might clarify a bit more:
One chunk at a time: MongoDB chunk migration happens in a queue mechanism and only one chunk at a time are migrated.
Balancer lock: Balancer lock information might give you some more idea of what is being migrated. You should also be able to see log entries is chunk migration in your mongos log files.
One option you have is to do some pre-splitting in your collections. The pre-splitting process essentially configured an empty collection to start balanced and avoid being imbalanced in the first place. Because once they get imbalanced the chunk migration process might not be your friend.
Also, you might want to revisit your shard keys. You are probably doing something wrong with your shard keys that's causing a lot of imbalance.
Plus, your data size doesn't seem too large to me to warrant a sharded configuration. Remember, never do a sharded configuration unless you are forced by your data size/working set size attributes. Because sharding is not free (you are probably already feeling the pain).

Resources