ArangoDB WARNING [3ad54] {engines} slow background settings sync - database

I have a ArangoDB 3.8.7 database running on a AWS instance that has ~200 million records (~ 1000 new records per minute).
During the day when user request is higher I keep seeing this warning in the database logs and the requests responses starts getting really slow (from the normal ~500 mS to 5-15 secs).
WARNING [3ad54] {engines} slow background settings sync
I use a large AWS instance c5a.12xlarge (48 vCPUs) with 98 GB RAM and even AWS analysis shows my instance is over provisioned.
i-0c41xxxxxxxxxxx is over-provisioned
Compute Optimizer found that this instance's CPU, network bandwidth and network PPS are over-provisioned.
I'm running a WAL compaction task every 60 seconds. (i've tried lowering it to 15 seconds and it seems it gets a little worse). When it was 10 minutes was also terrible.
2022-11-24T14:45:35Z [1303] WARNING [3ad54] {engines} slow background settings sync: 9.240683 s
2022-11-24T14:45:49Z [1303] WARNING [3ad54] {engines} slow background settings sync: 11.222022 s
2022-11-24T14:46:05Z [1303] WARNING [3ad54] {engines} slow background settings sync: 14.198186 s
2022-11-24T14:46:18Z [1303] WARNING [3ad54] {engines} slow background settings sync: 10.272200 s
2022-11-24T14:46:34Z [1303] WARNING [3ad54] {engines} slow background settings sync: 13.703265 s
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} --------------------------
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} Running compaction task...
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} Compacting access...
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} Compacting accounts...
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} Compacting addresses...
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} Compacting products...
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} Compacting phones...
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} Compacting call_log...
2022-11-24T14:46:35Z [1303] INFO [99d80] {general} --------------------------
Is there a way to optimize this since my instance is more than enough to handle it? And what exactly does this warning means?
Edit: Today I've upgraded to ArangoDB 3.10.1 and also upgraded my AWS instance to c6a.16xlarge (64 vCPUs) !!! And the problem persists.
BTW: the main issues are not the warning messages themselves, the issue is the lag, data corruption/writing lock errors and huge delays that occurs when these warnings are being shown.
Dec 01 01:24:31 sudo[1402]: Caused by: com.arangodb.ArangoDBException: Response: 409, Error: 1200 - AQL: timeout waiting to lock key Operation timed out: Timeout waiting to lock key; key: 12430138595 (while executing)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.util.ResponseUtils.checkError(ResponseUtils.java:55)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.velocystream.VstCommunication.checkError(VstCommunication.java:157)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.velocystream.VstCommunicationSync.execute(VstCommunicationSync.java:144)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.velocystream.VstCommunicationSync.execute(VstCommunicationSync.java:45)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.velocystream.VstCommunication.execute(VstCommunication.java:149)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.velocystream.VstCommunication.execute(VstCommunication.java:144)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.velocystream.VstProtocol.execute(VstProtocol.java:46)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.ArangoExecutorSync.execute(ArangoExecutorSync.java:71)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.ArangoExecutorSync.execute(ArangoExecutorSync.java:57)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.internal.ArangoDatabaseImpl.query(ArangoDatabaseImpl.java:171)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.springframework.core.template.ArangoTemplate.query(ArangoTemplate.java:358)
Dec 01 01:24:31 sudo[1402]: at com.arangodb.springframework.repository.query.AbstractArangoQuery.execute(AbstractArangoQuery.java:83)
Dec 01 01:24:31 sudo[1402]: at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor$QueryMethodInvoker.invoke(QueryExecutorMethodInterceptor.java:195)
Dec 01 01:24:31 sudo[1402]: at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.doInvoke(QueryExecutorMethodInterceptor.java:152)
Dec 01 01:24:31 sudo[1402]: at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.invoke(QueryExecutorMethodInterceptor.java:130)
Dec 01 01:24:31 sudo[1402]: at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
Dec 01 01:24:31 sudo[1402]: at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:95)
Dec 01 01:24:31 sudo[1402]: at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
Dec 01 01:24:31 sudo[1402]: at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212)

There might be a few issues to look up
If the database is not under heavy load and is running on a fast system, you may need to increase the --server.background-sync-wait-threshold setting in the database configuration(BECAREFULL).
This setting determines the maximum amount of time that the database will wait for the background settings sync to complete before issuing a warning. Increasing this value will allow the database to wait longer for the sync to complete, but may result in slower performance.
Also, you may check if the database is running on a slow system by looking at the system specifications, such as the CPU and disk speed. If the system is slow, you may need to upgrade to a faster system or use a database engine that is optimized for slow systems.
Heavy load on the DB server might raise the issue too.
You also might check this GitHub Issue #15080

Related

SQL Server commit and working set

There's a SQL Server instance MSSQLSERVER running on local host in windows 7. I realized that its commit is much larger than its working set. Here’s a comparison between my local instance and another instance MS_MSBI_SSDS running on Windows Server 2008R2.
Local SQL Server
Image PID Hard Faults/sec Commit(KB) Working Set (KB) Sharable(KB)
sqlservr.exe 2380 0 45 615 948 61 992 17 784
Remote SQL Server
Image PID Hard Faults/sec Commit(KB) Working Set (KB) Sharable(KB) Private(KB)
sqlservr.exe 1964 1 6 464 988 5 496 884 40 608 5 456 636
The large amount of commit makes the local machine almost unusable. The commit charge is at 100% when MSSQLSERVER launched. Please notice that there isn’t any particular process running on the local SQL Server. And it has 2 databases (8GB), copied from the remote one.
My questions are
Why the local instance has a large commit when it has only a small working set?
Can I find what have been actualy committed ?
How to decrease its commit charge ?
Might the problem come from McAfee ? I don't have right to modify it due to company policy. What can I do ? Here's a relative post SQLSERVR.EXE High Commit Usage causing a low virtual memory condition.

Windows Server 2012 with ASP.NET MVC application stops working (ESENT errors)

The solution is an ASP.NET MVC application using E/F hosted in IIS on a Windows Server 2012 R2 Standard VM hosted in a Hyper-V environment. The same VM is running SQL Server 2012.
The hosting environment is hosting 30 other solutions and there is plenty of free disk space and no known disk problems with hosting environment or VM (chkdsk and sfc has been run on VM and did not report any problems).
The problem is that the solution/server stops working for short periods of 5-1o minutes and every time we see event ID 508/533 from ESENT and a message about writing to "C:\Windows\system32\LogFiles\Sum".
A similar message has been seen with sqlsvr but this was solved by giving everyone all rights to C:\Windows\system32\LogFiles\Sum.
When the problem persists, it affects the whole VM and sometimes it is no even possible to connect via remote desktop.
We have seen a high number of open SQL Server connections when the problems occurs and prior to introducing caching for a specific Web API method we were actually able to empty the SQL Server connection pool. Just in case we have changed the connection pool from 100 to 200 connections even though we have not seen this particular problem since we introduced the cache.
All DbContext instances are disposed by "using", an ApiController.Dispose override or a Controller.Dispose override and only one SqlConnection are used (for the logging system).
I suspect the problem to be outside the solution and that the high number of SQL Server connections are related to the fact that SQL Server is unable to write to the disk.
Below is some Windows Event Log excerpts for three recent "break downs" with some additional info about the number of web request prior to the problem and after the server has automatically recovered.
Any suggestions?
web requests during the 10 minutes right before the problem: 1399
web requests during the first 10 minutes after the server has recovered: 1630
18-03-2015 20:07:20 833 MSSQLSERVER
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\Xxx.mdf] in database [Xxx] (5). The OS file handle is 0x0000000000000A7C. The offset of the latest long I/O is: 0x000003e104e000
18-03-2015 20:07:40 833 MSSQLSERVER
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\Xxx_log.ldf] in database [Xxx] (5). The OS file handle is 0x0000000000000A8C. The offset of the latest long I/O is: 0x0000007f203000
18-03-2015 20:08:16 533 ESENT
svchost (1740) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 1806336 (0x00000000001b9000) for 4096 (0x00001000) bytes has not completed for 36 second(s). This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
18-03-2015 20:17:14 508 ESENT
svchost (1740) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 1806336 (0x00000000001b9000) for 4096 (0x00001000) bytes succeeded, but took an abnormally long time (36 seconds) to be serviced by the OS. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
web requests during the 10 minutes right before the problem: 696
web requests during the first 10 minutes after the server has recovered: 614
19-03-2015 01:17:19 533 ESENT
svchost (1740) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 3067904 (0x00000000002ed000) for 4096 (0x00001000) bytes has not completed for 36 second(s). This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
19-03-2015 01:33:02 508 ESENT
svchost (1740) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 3067904 (0x00000000002ed000) for 4096 (0x00001000) bytes succeeded, but took an abnormally long time (983 seconds) to be serviced by the OS. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
19-03-2015 01:33:03 833 MSSQLSERVER
SQL Server has encountered 5 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\Xxx_log.ldf] in database [Xxx] (5). The OS file handle is 0x0000000000000A8C. The offset of the latest long I/O is: 0x000000a389d000
web requests during the 10 minutes right before the problem: 555
web requests during the first 10 minutes after the server has recovered: 784
19-03-2015 03:33:51 833 MSSQLSERVER
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\Xxx_log.ldf] in database [Xxx] (5). The OS file handle is 0x0000000000000A8C. The offset of the latest long I/O is: 0x000000aa95f000
19-03-2015 03:40:48 533 ESENT
svchost (1740) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 3846144 (0x00000000003ab000) for 4096 (0x00001000) bytes has not completed for 36 second(s). This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
19-03-2015 03:40:48 833 MSSQLSERVER
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\DATA\MSDBLog.ldf] in database [msdb] (4). The OS file handle is 0x0000000000000A90. The offset of the latest long I/O is: 0x00000000108000
19-03-2015 03:40:49 508 ESENT
svchost (1740) A request to write to the file "C:\Windows\system32\LogFiles\Sum\Svc.log" at offset 3846144 (0x00000000003ab000) for 4096 (0x00001000) bytes succeeded, but took an abnormally long time (36 seconds) to be serviced by the OS. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
19-03-2015 03:40:49 17894 MSSQLSERVER
Dispatcher (0x1a88) from dispatcher pool 'XE Engine main dispatcher pool' Worker 0x00000000F03B8160 appears to be non-yielding on Node 0. Approx CPU Used: kernel 0 ms, user 0 ms, Interval: 336140.
Disk I/O problems was my initial thought but the "funny" thing is that it actually never has happened during peak hours and that the server during peak hours is not stressed on CPU or disk I/O.
I cannot find any VM disk errors. I have no access to the hosting environment but I am told that there are no disk problems. The hosting environment is performing VM backups and if this is the problem, there is nothing to do about it, as it is required. I might try to have the VM moved to another disk but I do not know if this is possible.
Currently we have set up some detailed disk I/O monitoring on the VM and hopefully this will give us some information about the problem but I rather doubt it.
Maybe the VM is just "sick" and the next step might be to create a new one from scratch…
It sounds like your disk is just plain overloaded, since I/Os are taking so long. Ideally they should take around 10 milliseconds. Instead, they're taking over 1000x that long.
Since you're running in a VM, though, tracking down the problem can be a bit more tricky. Is it due to the I/O load in the virtual machine, or on the host? Your VM disk may be shared with other I/O load of the host.
Can you move the database to a different volume in the VM, hosted on a different physical spindle of the host?
Another possibility is that the underlying storage is going bad, and the I/Os are being retried by the underlying hardware.
-martin

Apache Proxy Plugin handling of JVM ID in JSESSION Cookie

I am trying to understand the mapping between the JVMID present in the JSESSION Cookie and the ipaddr:port of the managed server. Few questions below -
Who generates the JVMID and how does apache plugin know the JVMID of a given node. Does it get it back in the response from the server (may be as part of the Dynamic Server List?).
If we send a request to an apache with a JSESSION cookie containing a JVMID, and that apache hasn’t handled any requests yet, what would be the behavior?
Assuming that apache maintains a local mapping between JVMIDs and node addresses, how does this get updated? (specially in case of apache restart or a managed server restart)
See more at: http://middlewaremagic.com/weblogic/?p=654#comment-9054
1) The JVM ID is generated from each Weblogic server and appended to the JSESSIONID.
Apache logs the individual server HASH and maps it to the respective Managed server, and is able to send it to the same weblogic managed server as the previous request.
Here is an Example log from http://www.bea-weblogic.com/weblogic-server-support-pattern-common-diagnostic-process-for-proxy-plug-in-problems.html
Mon May 10 13:14:40 2004 getpreferredServersFromCookie: -2032354160!-457294087
Mon May 10 13:14:40 2004 GET Primary JVMID1: -2032354160
Mon May 10 13:14:40 2004 GET Secondary JVMID2: -457294087
Mon May 10 13:14:40 2004 [Found Primary]: 172.18.137.50:38625:65535
Mon May 10 13:14:40 2004 list[0].jvmid: -2032354160
Mon May 10 13:14:40 2004 secondary str: -457294087
Mon May 10 13:14:40 2004 list[1].jvmid: -457294087
Mon May 10 13:14:40 2004 secondary str: -457294087
Mon May 10 13:14:40 2004 [Found Secondary]: 172.18.137.54:38625:65535
Mon May 10 13:14:40 2004 Found 2 servers
2) If the plugin is installed on the new Apache as well, the moment Apache starts up it will ping all available Weblogic servers to report them as Live or Dead (my terms used here, not official) - while doing that health check it gets the JVMID for each available Weblogic. After that when it will receive the first request with a pre-existing JVMID - it can direct correctly.
3) there are some params like DynamicServerList ON - if it's On it keeps polling for Healthy Weblogics, if OFF then it send it to a hardcoded list only. so if On - then it's pretty dynamic

NTP Configuration without Internet

I am trying to setup a local NTP Server without Internet Connection.
Below is my ntp.conf on Server
# Server
server 127.127.1.0
fudge 127.127.1.0 stratum 5
broadcast 10.108.190.255
Below is my ntp.conf on Clients
# Clients
server 10.108.190.14
broadcastclient
but my clients are not sync with the server. Output to ntpq -p on Clients show that they are not taking time from the server, and server ip is show at stratum 16
Could any one please help in this issue.
The server should use its local clock as the source. A better set up is to use orphan mode for isolated networks which gives you fail-over. Check out the documentation:
http://www.eecis.udel.edu/~mills/ntp/html/orphan.html
You need to configure the clients with th e prefer keyword. ntpd tries its hardest not to honor local undisciplined clocks in order to prevent screwups.
server 10.108.190.14 prefer
For more information see: http://www.ntp.org/ntpfaq/NTP-s-config-adv.htm#AEN3658
This is all assuming that you have included the full and entire ntp.con and did not leave out any bits about restrict lines.
How about using chrony?
Steps
Install chrony in both your devices
sudo apt install chrony
Let's assume the server IP address 192.168.1.87 then client configuration (/etc/chrony/chrony.conf) as follows:
server 192.168.1.87 iburst
keyfile /etc/chrony/chrony.keys
driftfile /var/lib/chrony/chrony.drift
log tracking measurements statistics
logdir /var/log/chrony
Server configuration (/etc/chrony/chrony.conf), assume your client IP is 192.168.1.14
keyfile /etc/chrony/chrony.keys
driftfile /var/lib/chrony/chrony.drift
log tracking measurements statistics
logdir /var/log/chrony
local stratum 8
manual
allow 192.0.0.0/24
allow 192.168.1.14
Restart chrony in both computers
sudo systemctl stop chrony
sudo systemctl start chrony
5.1 Checking on the client-side,
sudo systemctl status chrony
`**output**:
июн 24 13:26:42 op-desktop systemd[1]: Starting chrony, an NTP client/server...
июн 24 13:26:42 op-desktop chronyd[9420]: chronyd version 3.2 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SECHASH +SIGND +ASYNCDNS +IPV6 -DEBUG)
июн 24 13:26:42 op-desktop chronyd[9420]: Frequency -6.446 +/- 1.678 ppm read from /var/lib/chrony/chrony.drift
июн 24 13:26:43 op-desktop systemd[1]: Started chrony, an NTP client/server.
июн 24 13:26:49 op-desktop chronyd[9420]: Selected source 192.168.1.87`
5.1 chronyc tracking output:
Reference ID : C0A80157 (192.168.1.87)
Stratum : 9
Ref time (UTC) : Thu Jun 24 10:50:34 2021
System time : 0.000002018 seconds slow of NTP time
Last offset : -0.000000115 seconds
RMS offset : 0.017948076 seconds
Frequency : 5.491 ppm slow
Residual freq : +0.000 ppm
Skew : 0.726 ppm
Root delay : 0.002031475 seconds
Root dispersion : 0.000664742 seconds
Update interval : 65.2 seconds
Leap status : Normal

Solr excessive logging and autowarming - is it normal?

I am pretty new to Solr, so I apologize if this is a stupid question :)
I have a Solr process running and logging stuff to file. Log level set to INFO I believe. Regardless of this, it still logs like crazy even though nothing being searched really. Logs contain records like these mostly:
INFO: autowarming result for Searcher#7c35a3be main
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
May 31, 2012 6:53:45 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher#7c35a3be main from Searcher#7dde0950 main
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=6188938,cumulative_hits=2441,cumulative_hitratio=0.00,cumulative_inserts=6186497,cumulative_evictions=4581707}
May 31, 2012 6:53:45 PM org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher#7c35a3be main
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=6188938,cumulative_hits=2441,cumulative_hitratio=0.00,cumulative_inserts=6186497,cumulative_evictions=4581707}
May 31, 2012 6:53:45 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to Searcher#7c35a3be main
May 31, 2012 6:53:45 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done.
May 31, 2012 6:53:45 PM org.apache.solr.core.SolrCore registerSearcher
INFO: [] Registered new searcher Searcher#7c35a3be main
May 31, 2012 6:53:45 PM org.apache.solr.search.SolrIndexSearcher close
INFO: Closing Searcher#7dde0950 main
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=6188938,cumulative_hits=2441,cumulative_hitratio=0.00,cumulative_inserts=6186497,cumulative_evictions=4581707}
May 31, 2012 6:53:45 PM org.apache.solr.update.processor.LogUpdateProcessor finish
Is this normal? This seems to put a pretty hefty load on the system(nothing to dramatic, but still).
I am just trying to understand what exactly it is doing and why.
IMHO in production environment, until you have a problem, you should use WARNING level (as application servers do).
You can configure logging through Solr admin console (for local Jetty URL would be: http://localhost:8983/solr/admin/logging) and it can be done for every package/class separately.
Logging levels are:
FINEST
FINE
CONFIG
INFO
WARNING
SEVERE
OFF
If you leave it unset, INFO is used.
By default the logging level is info.
When solr loads the core, it will load all the configuration files for the cores and auto warm the caches and all this would be logged out in the log files.
Configuring logging can help you configure your logging to the level you need.
You can configure the autowarming for Solr in the solrconfig.xml configuration file.

Resources