Hiredis timeout with large number of concurrent requests - c

I'm using a redis integration plugin for Varnish called libvmod-redis. I'm seeing an issue where if I get a large number of concurrent requests, around 350, redis starts timing out and I eventually get a segfault in Varnish.
I get these errors:
varnishd[27892]: Child (27893) said redis error (connect): Connection timed out
varnishd[27892]: Child (27893) said redis error (command): err=1 errstr=Connection timed out
varnishd[19528]: Child (19529) said redis error (command): err=1 errstr=select(2): Invalid argument
varnishd[19528]: Child (19529) said redis error (command): err=1 errstr=Connection timed out
varnishd[19528]: last message repeated 9 times
varnishd[19528]: Child (19529) said redis error (command): err=1 errstr=select(2): Invalid argument
varnishd[19528]: Child (19529) said redis error (connect): fcntl(F_GETFL): Bad file descriptor
varnishd[19528]: Child (19529) said redis error (command): err=1 errstr=fcntl(F_GETFL): Bad file descriptor
kernel: [282284.005658] varnishd[19727] general protection ip:7f1f9dea1427 sp:7f1f4123c120 error:0 in libhiredis.so.0.10[7f1f9de9f000+9000]
My timeout is 1 second, and I'm using an ElastiCache node for Redis. I'm wondering what exactly could be failing here. I'm not an expert in C, so I feel like I'm missing something.

It would be interesting to see the code, especially around the select.
Since you have a lot of concurrent requests you better check the validity of your file descriptors: maybe on of them is closed during the process...

You may try the following:
To start, try with redis-cli --latency if the problem is the Redis server. If redis-cli has no latency issues from its point of view, the problem is the client.
If the problem is the server, you may want to follow this guide: http://redis.io/topics/latency

Related

DBus : Can't get match rules for my user's session bus

I'm trying to use dbus/tools/GetAllMatchRules.py to get diagnostic information. When I run it without parameters as my regular user I get "GetConnectionMatchRules failed: did you enable the Stats interface?"
I modified GetAllMatchRules to print the specific exception details. It now says
GetConnectionMatchRules failed: did you enable the Stats interface?: org.freedesktop.DBus.Error.AccessDenied: The caller does not have the necessary privileged to call this method
So then I'm wondering, does it work at all? So I sudo su and run it again and it gives me the kind of information I'd expect to see, just not for the right bus. Oddly, if I use the --system parameter, even root gets org.freedesktop.DBus.Error.AccessDenied .
The repository claims, in bus/example-session-disable-stats.conf.in , that
"If the Stats interface was enabled at compile-time, users can use it on
the session bus by default. Systems providing isolation of processes
with LSMs might want to restrict this. This can be achieved by copying
this file in #EXPANDED_SYSCONFDIR#/dbus-1/session.d/
"
But that's clearly not the case because my user can NOT access this information.
I even tried a brute force approach to disabling (commenting out) ALL deny statements at /usr/share/dbus-1/system.conf and reloading and it still doesn't work. I also tried a full system restart in case I wasn't reloading correctly. I also did a system-wide search for system.conf in case it's actually using some other conf file that I'm not seeing, which would mean I'm modifying the wrong thing. I got a big hint that that's not the case when I had a typo (-- instead of --> for commenting out) and it failed to reload, but did reload once I fixed the typo.
I'm ok with the possibility that I can only do this signed in as root, so I also tried modifying GetAllMatchRules to use dbus.bus.BusConnection(), and force-feeding it the session address (unix:path=/run/user/1000/bus) which results in
"org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken."
Incidentally, this is the same issue that happens if I leave the code alone but use sudo -E su instead of just sudo su (the -E option in this case means that the $DBUS_SESSION_BUS_ADDRESS variable is retained)
I'm not sure what to try next...
Turns out there isn't currently a solution, the privilege error is simply the code that was chosen to indicate that the method is an unimplemented stub method

Error on iOS14 when loading OBJs into MDLAsset

When loading OBJs into an MDLAsset using the MDLAsset(url:) initializer (to eventually get the model into SceneKit), the operation fails frequently and inconsistently on iOS14. This operation works fine for these same files on previous iOS versions. I've also observed the bug on iPadOS, although maybe less frequently. Not sure if it's relevant, but these OBJs are pulled from server and stored locally. But this bug is occurring after files are already downloaded. Sometimes the same file will fail multiple times before randomly working, and vice versa.
The console output seems to indicate a failure to communicate with ModelIO XPC service. I tried restarting my device, but the bug continues to occur. Console output:
connection to com.apple.ModelIO.AssetLoader was interrupted
AssetLoader.loadURL errorHandler: Error Domain=NSCocoaErrorDomain Code=4097 "connection to service on pid 0 named com.apple.ModelIO.AssetLoader" UserInfo={NSDebugDescription=connection to service on pid 0 named com.apple.ModelIO.AssetLoader}
Couldn’t communicate with a helper application.
connection to com.apple.ModelIO.AssetLoader was interrupted
Has anyone else run into this issue on iOS14?
Alternatively, are there any workarounds anyone has tried in the meantime? As far as I know, loading an OBJ (that is downloaded from server) into SceneKit can only be done through ModelIO, without writing an OBJ parser myself.
This seems to be fixed in 14.3.
2020-10-13 18:31:36.989282+0300 Studia3D Viewer[1452:348335] connection to com.apple.ModelIO.AssetLoader was interrupted
2020-10-13 18:31:36.989368+0300 Studia3D Viewer[1452:347676] AssetLoader.loadURL errorHandler: Error Domain=NSCocoaErrorDomain Code=4097 “connection to service on pid 0 named com.apple.ModelIO.AssetLoader” UserInfo={NSDebugDescription=connection to service on pid 0 named com.apple.ModelIO.AssetLoader}
2020-10-13 18:31:36.989404+0300 Studia3D Viewer[1452:348332] connection to com.apple.ModelIO.AssetLoader was interrupted
2020-10-13 18:31:36.997352+0300 Studia3D Viewer[1452:347676] Не удалось установить связь с приложением-помощником.
The same thing happens with local files
No solution yet

How to handle exit codes with Camel Exec?

I'm using Camel Exec for automated shutdowns on some of our devices.
The shutdown command is pretty simple, and it mostly works fine:
from(START_DEEP_SLEEP)
.setBody(constant(null)) // we don't want stdin for exec
.setHeader(ExecBinding.EXEC_COMMAND_ARGS, constant("""shutdown $shutdownDelay "starting deep sleep shutdown" """))
.to("exec:sudo")
Obviously, this command will send a shutdown to the application executing it. That too isn't much of an issue, except that sometimes this produces an exit value of 143. I know the meaning of the return value, and it makes sense to see it here, but this only happens on some devices. Most others just return 0. They are all of the same type, so I really don't know where this discrepancy comes from, but it's not even that big an issue. The shutdown works none the less.
The problem is that camel exec logs this as an error:
ERROR 549 --- [Camel (camel-1) thread #1 - seda://start-deepsleep] o.a.camel.component.exec.ExecProducer : The command ExecCommand [args=[shutdown, now, starting deep sleep shutdown], executable=sudo, timeout=9223372036854775807, outFile=null, workingDir=null, useStderrOnEmptyStdout=false] returned exit value 143
This produces undesired noise in our monitoring, and I would rather not have it logged.
The core issue here is that Camel Exec does not throw, so there's no exception I could handle. It just logs the error, which then gets picked up by our log analysis.
I would like to handle that exit code gracefully without camel Exec logging an error. The return value is already logged separately anyways. How can I do that?
According to the docu http://camel.apache.org/exec.html there is a header ExecBinding.EXEC_EXIT_VALUE filled with the error number. Yours should be 143 (the docu states that this depends on the OS).
That could be a "hook" to handle the log entry, e.g. deleting the last entry with the same error number.
Of course this is only a cosmetic fix. The implementation could be like this:
from(START_DEEP_SLEEP)
.setBody(constant(null)) // we don't want stdin for exec
.setHeader(ExecBinding.EXEC_COMMAND_ARGS, constant("""shutdown $shutdownDelay "starting deep sleep shutdown" """))
.to("exec:sudo")
.when(header(ExecBinding.EXEC_EXIT_VALUE))
.to("direct:edit_the_log")
Please note that I did not test that code. Maybe u access that header with
.when(header(EXEC_EXIT_VALUE))
instead.
Please, inform me if that could be a proper solution or not.

MPI number of process

I am running a sample MPI program which prints hello world.
When I am running with 1,2....330 process it runs as expected.
But when the number goes beyond 330 it fails with below error.
Can some explain the reason for this.
I am running the program on my laptop which has i5 processor with 4 cores and 8 GB RAM.
[proxy:0:0#Abhishek-Machine] HYDU_create_process (./utils/launch/launch.c:25): pipe error (Too many open files)
[proxy:0:0#Abhishek-Machine] launch_procs (./pm/pmiserv/pmip_cb.c:705): create process returned error
[proxy:0:0#Abhishek-Machine] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:893): launch_procs returned error
[proxy:0:0#Abhishek-Machine] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#Abhishek-Machine] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#Abhishek-Machine] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec#Abhishek-Machine] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#Abhishek-Machine] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec#Abhishek-Machine] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
You are hitting an OS limit for socket descriptors or similar. Over subscribing your workstation to this degree is not a good idea and unlikely to work unless you change your system settings (which is not a good idea for this use case).

ClearCase db_server.exe: Error: db_VISTA error -920 (errno == "Bad file descriptor")

Seeing following errors in the server db log.
db_server.exe(): Error: db_VISTA error -920 (errno == "Bad file descriptor")
db_server.exe(): Error: Cannot open database in ".vbs\db"
Any idea, seeing this error for every 20 mins. This is happening for only two vobs.
Followed below with no luck.
http://www-01.ibm.com/support/docview.wss?uid=swg21236027
http://www-01.ibm.com/support/docview.wss?rs=984&uid=swg21148639
http://www-01.ibm.com/support/docview.wss?uid=swg21133944
The About db_VISTA errors page mentions:
db_VISTA database error -920 - no lock manager is installed
db_VISTA error 2 from OpenFileMapping() of lockmgr_almd
And reference the technote "DB_Vista -920 error and Error 2 from OpenFileMapping()"
Even if this isn't exactly the same error message, check the status of your lock manager (lockmgr.exe on Windows, lockmgr on Unix), both on the client and the server.
Regarding the db_server process, you can try a stop/restart ClearCase on the server, to reset both db_server and vob_server processes.
That can be also related to almd parameters, initially found in:
/opt/rational/clearcase/config/vob/db/vob_almd_params
(for all vobs), but also found in each vobs:
/path/to/vobstorage/yourVob.vbs/db/vob_almd_params
I usually try to raise those parameters in case of 920 errors.
For instance:
-u 4000 -q 16000
That (meaning those local vob configs) would explain why you see errors in only two of your vobs.
To stop the services on Windows: "How do I determine via Windows command line whether ALBD service is running?".

Resources