Frequent GC on docprocservice and container

Frequent GC on docprocservice and container - vespa

I'm running a performance test against vespa, the container looks slow to be unable to process the incoming more requests. Looked at vespa.log, there're lots of GC allocation failure logs. However, the system resources are pretty low (CPU<30%, mem<35%). Is there any configuration to optimize?
Btw, looks like the docprocservice is running on content node by default, how to tune jvmargs for docprocservice?
1523361302.261056 24298 container stdout info [GC (Allocation Failure) 3681916K->319796K(7969216K), 0.0521448 secs]
1523361302.772183 24301 docprocservice stdout info [GC (Allocation Failure) 729622K->100400K(1494272K), 0.0058702 secs]
1523361306.478681 24301 docprocservice stdout info [GC (Allocation Failure) 729648K->99337K(1494272K), 0.0071413 secs]
1523361308.275909 24298 container stdout info [GC (Allocation Failure) 3675316K->325043K(7969216K), 0.0669859 secs]
1523361309.798619 24301 docprocservice stdout info [GC (Allocation Failure) 728585K->100538K(1494272K), 0.0060528 secs]
1523361313.530767 24301 docprocservice stdout info [GC (Allocation Failure) 729786K->100561K(1494272K), 0.0088941 secs]
1523361314.549254 24298 container stdout info [GC (Allocation Failure) 3680563K->330211K(7969216K), 0.0531680 secs]
1523361317.571889 24301 docprocservice stdout info [GC (Allocation Failure) 729809K->100551K(1494272K), 0.0062653 secs]
1523361320.736348 24298 container stdout info [GC (Allocation Failure) 3685729K->316908K(7969216K), 0.0595787 secs]
1523361320.839502 24301 docprocservice stdout info [GC (Allocation Failure) 729799K->99311K(1494272K), 0.0069882 secs]
1523361324.948995 24301 docprocservice stdout info [GC (Allocation Failure) 728559K->99139K(1494272K), 0.0127939 secs]
services.xml:
<container id="container" version="1.0">
<config name="container.handler.threadpool">
<maxthreads>10000</maxthreads>
</config>
<config name="config.docproc.docproc">
<numthreads>500</numthreads>
</config>
<config name="search.config.qr-start">
<jvm>
<heapSizeAsPercentageOfPhysicalMemory>60</heapSizeAsPercentageOfPhysicalMemory>
</jvm>
</config>
<document-api />
<search>
<provider id="music" cluster="music" cachesize="64M" type="local" />
</search>
<nodes>
<node hostalias="admin0" />
<node hostalias="node2" />
</nodes>
</container>
# free -lh
total used free shared buff/cache available
Mem: 125G 43G 18G 177M 63G 80G
Low: 125G 106G 18G
High: 0B 0B 0B
Swap: 0B 0B 0B

Those GC messages are coming from the jvm and are normal and not real failures. It's just the way the JVM works, collecting garbage that the application creates and all those are minor collections in from the young generation. If you start seeing Full GC messages tuning would be required.
The 'docprocservice' is not involved in search serving either so you can safely ignore those for a serving test. Most likely your bottleneck is the underlaying content layer. What is the resource usage like there?
Regardless, running with 10K maxthreads seems excessive, the default 500 is more than enough - what kind of benchmarking client are you using?

Generally it's easier to help if you provide
The setup and HW configuration (e.g services.xml and document schema)
What type of queries/ranking profile are in use, field searched etc. Total number of documents and if you use a custom ranking profile how does the result compare with using the built-in 'unranked' ranking profile.
Average number of hits returned (&hits=x) parameter and average total hits
Resource usage (e.g vmstat/top/network util from container(s) and content node(s) when the latency starts climbing past your targeted latency SLA (bottleneck reached/max throughput)
Same as above but with only one client (no concurrency). If you are past your targeted latency SLA/expectation already with no concurrency you might have to review the features in use (Examples would be adding rank:filter to unranked fields, adding fast-search to attributes involved in the query and so on)
Benchmarking client used (e.g number of connections and parameters used). We usually use the vespa-fbench tool.
Some general resources on Benchmarking & Profiling Vespa
Benchmarking Vespa (including our own benchmark client using persistent connections, if you benchmark using none-persistent connections you might end up benchmarking the OS's ability to maintain the tcp connections) http://docs.vespa.ai/documentation/performance/vespa-benchmarking.html
Profiling & Sizing http://docs.vespa.ai/documentation/performance/
Feature tuning http://docs.vespa.ai/documentation/performance/feature-tuning.html
Scaling Vespa http://docs.vespa.ai/documentation/performance/sizing-search.html This has some interesting graphs (e.g the expected relationship between overall latency & total hits and the expected latency break down when saturation has been reached).

Related

Databricks: running spark-submit job with external jar file, 'Failed to load class' error

I am trying to test the following library: https://tech.scribd.com/blog/2021/introducing-sql-delta-import.html
I want to copy data from my SQL database to a data lake, in the delta format. I have created a mount point, databases, and an empty delta table. What I am trying to do now, is to run a databricks job with the following parameters:
["--class","io.delta.connectors.spark.JDBC.ImportRunner",
"/jars/sql_delta_import_2_12_0_2_1_SNAPSHOT.jar",
"jdbc:sqlserver:/myserver.database.windows.net:1433;database=mydatabase;user=myuser;password=mypass;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30",
"sourcedb.sourcetable",
"targetdb.targettable",
"PersonID"]
What I am getting is:
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Warning: Ignoring non-Spark config property: libraryDownload.sleepIntervalSeconds
Warning: Ignoring non-Spark config property: libraryDownload.timeoutSeconds
Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds
Error: Failed to load class io.delta.connectors.spark.JDBC.ImportRunner.
Fetching the jar file was logged correctly, so it was able to find it.
21/05/07 10:08:15 INFO Utils: Fetching dbfs:/jars/sql_delta_import_2_12_0_2_1_SNAPSHOT.jar to /local_disk0/tmp/spark-76e146dd-835d-4ddf-9b3b-f32d75c3cba2/fetchFileTemp8075862365042488320.tmp
I have unpacked the jar file and the path exists, so I am not sure what the cause could be. That's my first encounter with Scala, so I would appreciate any advice since I am a bit lost.
stdout output:
2021-05-07T13:46:40.434+0000: [GC (Allocation Failure) [PSYoungGen: 56320K->8326K(65536K)] 56320K->8342K(216064K), 0.0083761 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
2021-05-07T13:46:40.884+0000: [GC (Allocation Failure) [PSYoungGen: 64646K->7553K(65536K)] 64662K->7577K(216064K), 0.0076350 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2021-05-07T13:46:41.367+0000: [GC (Allocation Failure) [PSYoungGen: 63873K->8972K(65536K)] 63897K->9004K(216064K), 0.0069414 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
2021-05-07T13:46:41.422+0000: [GC (Metadata GC Threshold) [PSYoungGen: 25702K->6172K(121856K)] 25734K->6212K(272384K), 0.0058830 secs] [Times: user=0.01 sys=0.01, real=0.00 secs]
2021-05-07T13:46:41.428+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 6172K->0K(121856K)] [ParOldGen: 40K->6082K(78336K)] 6212K->6082K(200192K), [Metaspace: 20079K->20079K(1067008K)], 0.0277412 secs] [Times: user=0.06 sys=0.01, real=0.03 secs]
2021-05-07T13:46:42.235+0000: [GC (Allocation Failure) [PSYoungGen: 112640K->7697K(121856K)] 118722K->13851K(200192K), 0.0088850 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
2021-05-07T13:46:42.745+0000: [GC (Allocation Failure) [PSYoungGen: 120337K->5906K(195584K)] 126491K->12068K(273920K), 0.0112406 secs] [Times: user=0.02 sys=0.01, real=0.01 secs]
2021-05-07T13:46:42.881+0000: [GC (Metadata GC Threshold) [PSYoungGen: 28380K->4154K(197632K)] 34542K->10324K(275968K), 0.0055152 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2021-05-07T13:46:42.886+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 4154K->0K(197632K)] [ParOldGen: 6170K->9617K(121344K)] 10324K->9617K(318976K), [Metaspace: 33488K->33488K(1079296K)], 0.0902208 secs] [Times: user=0.32 sys=0.01, real=0.09 secs]
2021-05-07T13:46:44.026+0000: [GC (Allocation Failure) [PSYoungGen: 187904K->8219K(273920K)] 197521K->17845K(395264K), 0.0091905 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]
Heap
PSYoungGen total 273920K, used 195530K [0x000000073d980000, 0x0000000752280000, 0x00000007c0000000)
eden space 265216K, 70% used [0x000000073d980000,0x000000074906b950,0x000000074dc80000)
from space 8704K, 94% used [0x0000000751a00000,0x0000000752206f00,0x0000000752280000)
to space 10240K, 0% used [0x0000000750e80000,0x0000000750e80000,0x0000000751880000)
ParOldGen total 121344K, used 9625K [0x0000000638c00000, 0x0000000640280000, 0x000000073d980000)
object space 121344K, 7% used [0x0000000638c00000,0x0000000639566708,0x0000000640280000)
Metaspace used 50422K, capacity 52976K, committed 53248K, reserved 1095680K
class space used 6607K, capacity 6877K, committed 6912K, reserved 1048576K

So the actual issue was the io.delta.connectors.spark.JDBC.ImportRunner part. I have copy pasted it from the blog but the actual path should be lowercased io.delta.connectors.spark.jdbc.ImportRunner.

clickhouse - Clickhouse imported data is forcibly killed by the system

When I import a 120g text file to the Clickhouse, there are 400 million data in it. After importing more than 100 million data, I will be killed.
The import statement is as follows:
clickhouse-client --user default --password xxxxx --port 9000 -hbd4 --database="dbs" --input_format_allow_errors_ratio=0.1 --query="insert into ... FORMAT CSV" < /1.csv
The error is as follows:
2021.04.29 10:20:23.135790 [ 19694 ] {} <Fatal> Application: Child process was terminated by signal 9 (KILL). If it is not done by 'forcestop' command or manually, the possible cause is OOM Killer (see 'dmesg' and look at the '/var/log/kern.log' for the details).
Is the imported file too large, bursting the memory? Should I subdivide the file again?

take a look at system logs - they should have some clues:
as suggested in the error message - run dmesg and see if there's mention of OOM Killer [ kernel self-protection mechanism triggering on out-of-memory events ]. if that's the case - you're out of memory or you've granted too much memory to clickhouse.
see what clickhouse own logs tell. path to the log file is defined in clickhouse-server/config.xml, under yandex/logger/log - it's likely /var/log/clickhouse-server/clickhouse-server.log + /var/log/clickhouse-server/clickhouse-server.err.log

Confirmation that the message was sent to the CAN bus using socketCAN

I would like to confirm that my message has been saved on the CAN bus with socketCAN library.
The socketCAN documentation describes this possibility when using the recvmsg() function, I have problems with its implementation.
The function I want to achieve is to confirm that my message won the arbitration process.

I think mentioning recvmsg(2) you refer to the following paragraph of the SocketCAN docs:
MSG_CONFIRM: set when the frame was sent via the socket it is received on.
This flag can be interpreted as a 'transmission confirmation' when the
CAN driver supports the echo of frames on driver level, see 3.2 and 6.2.
In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set.
The key words here are "when the
CAN driver supports the echo of frames on driver level", so you have to ensure that first. Next, you need to enable the corresponding flags. Finally, such confirmation has nothing to do with arbitration. When a frame looses arbitration, the controller tries to re-transmit it as soon as the bus becomes free.

I think you can use the command "candump can0/can1" on your PC, it will shows the CAN packet received on given CAN interface.
Usage: candump [options] <CAN interface>+
(use CTRL-C to terminate candump)
Options: -t <type> (timestamp: (a)bsolute/(d)elta/(z)ero/(A)bsolute w date)
-c (increment color mode level)
-i (binary output - may exceed 80 chars/line)
-a (enable additional ASCII output)
-b <can> (bridge mode - send received frames to <can>)
-B <can> (bridge mode - like '-b' with disabled loopback)
-u <usecs> (delay bridge forwarding by <usecs> microseconds)
-l (log CAN-frames into file. Sets '-s 2' by default)
-L (use log file format on stdout)
-n <count> (terminate after receiption of <count> CAN frames)
-r <size> (set socket receive buffer to <size>)

UFS - how a 0 bytes file broke filesystem header?

For those reaching here; Unfortunately I could not recover the data, after various tries and reproducing the problem it was too costy to keep trying, so we just used a past backup to recreate the information needed
A human error broke an 150G UFS filesystem (Solaris).
When trying to do a backup of the filesytem (c0t0d0s3) the ufsdump(1M) hasn't been correctly used.
I will explain the background that led to this ...
The admin used:
# ufsdump 0f /dev/dsk/c0t0d0s3 > output_1
root#ats-br000432 # ufsdump 0f /dev/dsk/c0t0d0s3 > output_1
Usage: ufsdump [0123456789fustdWwnNDCcbavloS [argument]] filesystem
This is a bad usage, so it created a file called output_1 with 0 bytes:
# ls -la output_1
-rw-r--r-- 1 root root 0 abr 12 14:12 output_1
Then, the syntax used was:
# ufsdump 0f /dev/rdsk/c0t0d0s3 output_1
Which wrote that 0 bytes file output_1 to /dev/rdsk/c0t0d0s3 - which was the partition slice
Now, interestingly, due to being a 0 bytes file, we thought that this would cause no harm to the filesystem, but it did.
When trying to ls in the mountpoint, the partition claimed there was an I/O error, when umounting and mounting again, the filesystem showed no contents, but the disk space was still showing as used just like it was previously.
I assume, at some point, the filesystem 'header' was affected, right? Or was it the slice information?
A small fsck try brings up this:
** /dev/rdsk/c0t0d0s3
** Last Mounted on /database
** Phase 1 - Check Blocks and Sizes
INCORRECT DISK BLOCK COUNT I=11 (400 should be 208)
CORRECT?
Disk block count / I=11
this seems that the command broke filesystem information regarding its own contents, right?
When we tried to fsck -y -F ufs /dev/dsk.. various files have been recovered, but not the dbf files we are after (which are GB sized)
What can be done now? Should I try every superblock information from newfs -N ?
EDIT: new information regarding partition
newfs output showing superblock information
# newfs -N /dev/rdsk/c0t0d0s3
Warning: 2826 sector(s) in last cylinder unallocated
/dev/rdsk/c0t0d0s3: 265104630 sectors in 43149 cylinders of 48 tracks, 128 sectors
129445,6MB in 2697 cyl groups (16 c/g, 48,00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
.....................................................
super-block backups for last 10 cylinder groups at:
264150944, 264241184, 264339616, 264438048, 264536480, 264634912, 264733344,
264831776, 264930208, 265028640

Invalid value "zookeeper" for flag -a: valid streams are STDIN, STDOUT and STDERR

I am trying to follow this blog to setup solr cloud with docker:
https://lucidworks.com/blog/solrcloud-on-docker/
I was able to create the zookeeper image successfully. docker images command lists the image too.
However, when I try to create and run the zookeeper container with the following command, it errors out:
docker run -name zookeeper -p 2181 -p 2888 -p 3888 myusername/zookeeper:3.4.6
Error:
Warning: '-n' is deprecated, it will be removed soon. See usage.
invalid value "zookeeper" for flag -a: valid streams are STDIN, STDOUT and STDERR
See 'docker run --help'.
flag provided but not defined: -name
See 'docker run --help'.
What am I missing here?

Please use --name instead.
Usage: docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
Run a command in a new container
-a, --attach=[] Attach to STDIN, STDOUT or STDERR
--add-host=[] Add a custom host-to-IP mapping (host:ip)
--blkio-weight=0 Block IO weight (relative weight)
-c, --cpu-shares=0 CPU shares (relative weight)
--cap-add=[] Add Linux capabilities
--cap-drop=[] Drop Linux capabilities
--cgroup-parent="" Optional parent cgroup for the container
--cidfile="" Write the container ID to the file
--cpu-period=0 Limit CPU CFS (Completely Fair Scheduler) period
--cpu-quota=0 Limit CPU CFS (Completely Fair Scheduler) quota
--cpuset-cpus="" CPUs in which to allow execution (0-3, 0,1)
--cpuset-mems="" Memory nodes (MEMs) in which to allow execution (0-3, 0,1)
-d, --detach=false Run container in background and print container ID
--device=[] Add a host device to the container
--dns=[] Set custom DNS servers
--dns-search=[] Set custom DNS search domains
-e, --env=[] Set environment variables
--entrypoint="" Overwrite the default ENTRYPOINT of the image
--env-file=[] Read in a file of environment variables
--expose=[] Expose a port or a range of ports
--group-add=[] Add additional groups to run as
-h, --hostname="" Container host name
--help=false Print usage
-i, --interactive=false Keep STDIN open even if not attached
--ipc="" IPC namespace to use
-l, --label=[] Set metadata on the container (e.g., --label=com.example.key=value)
--label-file=[] Read in a file of labels (EOL delimited)
--link=[] Add link to another container
--log-driver="" Logging driver for container
--log-opt=[] Log driver specific options
--lxc-conf=[] Add custom lxc options
-m, --memory="" Memory limit
--mac-address="" Container MAC address (e.g. 92:d0:c6:0a:29:33)
--memory-swap="" Total memory (memory + swap), '-1' to disable swap
--memory-swappiness="" Tune a container's memory swappiness behavior. Accepts an integer between 0 and 100.
--name="" Assign a name to the container
--net="bridge" Set the Network mode for the container
--oom-kill-disable=false Whether to disable OOM Killer for the container or not
-P, --publish-all=false Publish all exposed ports to random ports
-p, --publish=[] Publish a container's port(s) to the host
--pid="" PID namespace to use
--privileged=false Give extended privileges to this container
--read-only=false Mount the container's root filesystem as read only
--restart="no" Restart policy (no, on-failure[:max-retry], always)
--rm=false Automatically remove the container when it exits
--security-opt=[] Security Options
--sig-proxy=true Proxy received signals to the process
-t, --tty=false Allocate a pseudo-TTY
-u, --user="" Username or UID (format: <name|uid>[:<group|gid>])
--ulimit=[] Ulimit options
--disable-content-trust=true Skip image verification
--uts="" UTS namespace to use
-v, --volume=[] Bind mount a volume
--volumes-from=[] Mount volumes from the specified container(s)
-w, --workdir="" Working directory inside the container

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Frequent GC on docprocservice and container - vespa

Related

Databricks: running spark-submit job with external jar file, 'Failed to load class' error

clickhouse - Clickhouse imported data is forcibly killed by the system

Confirmation that the message was sent to the CAN bus using socketCAN

UFS - how a 0 bytes file broke filesystem header?

Invalid value "zookeeper" for flag -a: valid streams are STDIN, STDOUT and STDERR

Categories

Resources