We use the rsyslogd daemon for logging debug messages in several applications. With full debug on, the server will hang in the openlog() call being made in the application to log messages. Looks to be doing a lock in the kernel library code in glibc.
Is this a known issue under load or are we exposing some contention issue in the kernel? RHEL 6.4 running kernel 2.6.32-358.2.1 (i386).
rsyslog version was 8.4.1 but we upgraded to latest this afternoon 8.17 and are still running tests to see if the problem continues.
Related
I'm running a DPDK process on a Linux and tried to follow the example down below to analysis core effiency.
https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/core-utilization-in-dpdk-apps.html#core-utilization-in-dpdk-apps_RX
I ran the program on Ubuntu 20.04 and copied the file back to my windows laptop due to the lack of GUI components. And I can't get the DPDK Rx Spin Time shown before.
vtune results
warning logs
It's because of the warning logs. or am I missing something? Any help will be grateful.
Please check that application you are profiling contains the necessary callbacks for RX and TX tracing.
To enable RX burst statistic collection in VTune Profiler you need:
build DPDK with with the CONFIG_RTE_ETHDEV_RXTX_CALLBACKS and CONFIG_RTE_ETHDEV_PROFILE_WITH_VTUNE options enabled
rebuild your test application with this DPDK
run VTune I/O analysis
To check that VTune profiling is enabled ran:
nm <test_app> | grep profile_hook_rx_burst_cb
I am using TIOMAP/AM3517 based custom board which has ARM7 core + Linux 2.6.32 running on it.
Recently, I started observing an issue in which all the interfaces serial console, RNDIS/USB, Network stops responding (my type on keyboard connected to serial port/console port doesn't go through + no device is detected upon removal/insertion of USB sticks/BT dongle + Ethernet/network activities are stopped (can't login over ssh or access webPage that the system has.
I checked the power to the CPU and other power rails using DMM and it looks fine.
There is no keyboard attached so MagicSysRq cannot be used while the issue happens.
I have watchdog on the system which rebooted the system, however to reproduce the issue and to get more information of the system's state, now i have disabled the watchdog.
Finally, my system/linux is configured to reboot upon detecting softlockup which is not happening either. Also, I have enabled verbose debug options under kernel hacking but nothing seems to give any more information.
How do I debug the issue to identify the rot cause of it?
At some point my site, running on Apache2 with mod_wsgi just stops processing requests. The connection to server is maintained and client waits for responce, but it never is returned by apache. The server at this time is at 0% CPU, and nothing is processing. I think, apache just sends request to queue and never gets them out of there.
When I perform apache2ctl graceful the problem does not resolve. Only after apache2ctl restart.
My site is a 4 instance wsgi application of Pyramid and 2 instances of Zope 3. It is running normaly and does not have speed problems, that I am aware of.
versions:
Ubuntu 10.04
apache2 2.2.14-5ubuntu8.9
libapache2-mod-wsgi 2.8-2ubuntu1
Sounds like you are using embedded mode to run the multiple applications and you are using third party C extensions that have problems in sub interpreters, resulting in potential deadlock. Else your code is internally deadlocking or blocking on external services and never returning, causing exhaustion of available processes/threads.
For a start, you should look at using daemon mode and delegate each web application to a distinct daemon process group and then forcing each to run in the main interpreter.
See:
http://code.google.com/p/modwsgi/wiki/QuickConfigurationGuide#Delegation_To_Daemon_Process
http://code.google.com/p/modwsgi/wiki/ApplicationIssues#Python_Simplified_GIL_State_API
Otherwise use debugging tips described in:
http://code.google.com/p/modwsgi/wiki/DebuggingTechniques
for getting stack traces about what application is doing.
I have an system running embedded linux and it is critical that it runs continuously. Basically it is a process for communicating to sensors and relaying that data to database and web client.
If a crash occurs, how do I restart the application automatically?
Also, there are several threads doing polling(eg sockets & uart communications). How do I ensure none of the threads get hung up or exit unexpectedly? Is there an easy to use watchdog that is threading friendly?
You can seamlessly restart your process as it dies with fork and waitpid as described in this answer. It does not cost any significant resources, since the OS will share the memory pages.
Which leaves only the problem of detecting a hung process. You can use any of the solutions pointed out by Michael Aaron Safyan for this, but a yet easier solution would be to use the alarm syscall repeatedly, having the signal terminate the process (use sigaction accordingly). As long as you keep calling alarm (i.e. as long as your program is running) it will keep running. Once you don't, the signal will fire.
That way, no extra programs needed, and only portable POSIX stuff used.
The gist of it is:
You need to detect if the program is still running and not hung.
You need to (re)start the program if the program is not running or is hung.
There are a number of different ways to do #1, but two that come to mind are:
Listening on a UNIX domain socket, to handle status requests. An external application can then inquire as to whether the application is still ok. If it gets no response within some timeout period, then it can be assumed that the application being queried has deadlocked or is dead.
Periodically touching a file with a preselected path. An external application can look a the timestamp for the file, and if it is stale, then it can assume that the appliation is dead or deadlocked.
With respect to #2, killing the previous PID and using fork+exec to launch a new process is typical. You might also consider making your application that runs "continuously", into an application that runs once, but then use "cron" or some other application to continuously rerun that single-run application.
Unfortunately, watchdog timers and getting out of deadlock are non-trivial issues. I don't know of any generic way to do it, and the few that I've seen are pretty ugly and not 100% bug-free. However, tsan can help detect potential deadlock scenarios and other threading issues with static analysis.
You could create a CRON job to check if the process is running with start-stop-daemon from time to time.
use this script for running your application
#!/bin/bash
while ! /path/to/program #This will wait for the program to exit successfully.
do
echo “restarting” # Else it will restart.
done
you can also put this script on your /etc/init.d/ in other to start as daemon
One of our customers has a newish problem: the application grinds to halt. Thread dump shows that all threads are hanging on network IO in JDBC calls.
We/I have never seen these 'network IO' hangs. Typically a slow machine w/ DB problems has either a) one or two long-running queries or b) some type of lock/deadlock. In either of these cases the threads 'hang' on different methods. I have never seen all 30+ threads hanging on network IO.
Below I have included an excerpt from the thread dump. All HTTP threads are hanging on the same java.net.SocketInputStream.read call.
I talked to their dba and sysadmin. According to them 'nothing has changed' in the environment recently which would cause this problem.
db environment
MSSQL 2005 64-bit Service Pack 2
Driver: sqljdbc.jar : 1.0 809 102
Note: they are running an older jdbc driver. AFAIK they tried upgrading from 1.0 to the 1.2 driver but had some other problem.
other environment issues
They're running both the app server and the db server in VMWare VM's. I don't know how this setup affects network performance.
Apparently this is the only application with this problem. I don't know anything else about their network architecture.
Questions
* any insights on this problem?
* if it is network, any next steps for analyzing?
Appendix A: Excerpt from Thread dump
All HTTP connections are hanging on the same method:
"TP-Processor31" daemon prio=5 tid=0x04085b78 nid=0x970 runnable [0x0764d000..0x0764fd6c]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at com.microsoft.sqlserver.jdbc.DBComms.receive(Unknown Source)
at com.microsoft.sqlserver.jdbc.IOBuffer.sendCommand(Unknown Source)
- locked (a com.microsoft.sqlserver.jdbc.DBComms)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.sendExecute(Unknown Source)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteQuery(Unknown Source)
We've had similar issues, and traced them back to a buggy JDK update (1.6.29).
We downloaded 1.6.27 (http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html#jdk-6u27-oth-JPR), re-set the JAVA_HOME environment and we were back on track.
It is the changes to JSSE in version 1.6 u29. The change is to partially patch CVE-2011-3389 and CVE-2011-3560.
If are not deploying applets and using java web-start. You might be able to just use the 1.6 u27 jsse.jar. You're still going to have the vulnerability, but it will allow the sqljdbc.jar and sqljdbc4.jar to work.
The other options to migrate to Java 7, the sqljdbc4.jar does work in that environment.
The patch is not complete fix. So expect more changes in future patches.