freeipmi - ipmimonitoring_sensors returning internal ipmi error - ipmi

I am executing the ipmimonitoring-sensors.c example provided in the freeipmi library.
It throws internal error sometimes. Issue is reproducible when i execute the program back to back couple of times. I need to wait approximately 30 sec after the last execution for the program to run properly. Has anyone faced this issue before? If yes, can you tell me how to avoid it.
This is the error ipmi_monitoring_sensor_readings_by_record_id: internal error
Thanks

FreeIPMI maintainer here. The "internal error" indicates some logical error that the library doesn't know how to handle. Given its coming from ipmi_monitoring_sensor_readings_by_record_id and it occurs when you run the program back to back, I would bet there is some internal IPMI issue on your system.
Perhaps the motherboard has some issue with a high amount of IPMI traffic or a sensor has issues with a high number of requests. Many of these situations are handled more gracefully (perhaps give a BUSY error or minimally SYSTEM error), but perhaps there is some combo of error situations I haven't yet seen. (Lots of motherboards return errors that would be considered non-standard or unexpected).
If you're interested in working through that, just send something onto the FreeIPMI mailing list.

Set the driver_type = -1 (for default) and it works.

Related

What the difference between error and blocking exception for target device?

Could you please provide more details about errors and exceptions:
What the difference between error and blocking exception for target device? What intent and what kind of response should we use in each case? Could you provide the examples?
Should we use ONLY errors codes in EXECUTE response? Are exceptions codes not available in EXECUTE response?
Can exceptions codes be used ONLY in QUERY response, which provides the status of the target device and all associated devices?
How should we handle blocking error of target device if desired error message is provided in the list of exceptions codes and there is no similar message in errors codes (for example, “inSoftwareUpdate”)? Could you please provide an example?
A couple notes from the documentation on this point:
You should return an error code when an issue causes an execute or query request to fail.
You should return an exception when there is an issue or alert associated with a command.
To help clarify this a bit more, an ERROR generally occurs when you are unable to process the intent (can't reach the device, device is already in the expected state, etc.). An EXCEPTION is typically a related state that doesn't necessarily indicate failure (I was able to lock the door, but FYI the battery is low). This can also be the state of another device when used with the StatusReport trait.
You can return either status where appropriate in response to an intent. See the reference pages for QUERY and EXECUTE intents for more details.

TI AM572x Cortex-A15 CPU core stuck

I have a problem with stability running TI AM5728 based custom board, similar to the Beaglebone X15. RTOS SW is running on one Cortex-A15 core MPU0 and sporadically (most often after several hours) freezes. When freezes it is impossible to connect to the MPU0 target by debugger, at the same time I can without any problems connect to the MPU1.
Debugger error:
CortexA15_0: Trouble Halting Target CPU: (Error -1323 # 0x1386AC)
Device failed to enter debug/halt mode because pipeline is stalled.
Power-cycle the board. If error persists, confirm configuration and/or
try more reliable JTAG settings (e.g. lower TCLK). (Emulation package
6.0.504.1)
For test purposes, I have started the simple program on the MPU1, and when MPU0 freezes MPU1 continue normal operation. WFE and WFI flag for MPU0 is inactive, moreover, I have made the additional test with trying to put MPU1 to the WFI/FORCED_OFF state. However, I still can connect with debugger and wakeup it from the FORCED_OFF state, as described in the technical manual.
I have dumped the registers by connection to the CS_DAP_DebugSS and have not found anything special. Register dump attached:
MPU_PRCM_PRM_C0_PM_CPU0_PWRSTCTRL
MPU_PRCM_DEVICE_PRM_RSTST
MPU_WUGEN_WKG_CONTROL_0
MPU_PRCM_CM_C0_CM_CPU0_CLKSTCTRL
What can be the potential stuck problem of just one core with failed attempts to connect with the debugger and second core running without problems?
Which hardware/software problem can potentially cause such behavior?
Thank you for any suggestions.
I just encounter the exact same problem.
Did you check your code at the address provided with the JTAG error (Error -1323 # 0x1386AC)? In my case it is an GPMC access, to a FPGA, which I can still access through CS_DAP_DebugSS.
I'm currently looking at errata i878, from revision L of the document. As it can take more than 48h to hang under stress test, I won't blindly apply the workaround. I'll modify my test, based on i878, trying to increase the failure rate, then I'll apply the workaround.

DB2 Communication Error

We recently developed an application which will run a query in DB2 and send a mail to the corresponding recipient. It works well in our local system and QA region. But in production, few queries failed (even if it's rare, like once in week). It throws the exception below.
Exception InnerDetails:
ERROR [40003] [IBM][CLI Driver] SQL30081N A communication error has
been detected. Communication protocol being used: "TCP/IP".
Communication API being used: "SOCKETS". Location where the error was
detected: "111.111.111.111". Communication function detecting the
error: "recv". Protocol specific error code(s): "10004", "", "".
SQLSTATE=08001
Since error occurs only in production and not very often, we are not sure whether it is the code or a setting issue. Do you have any idea?
We recently discussed this issue with our IBM rep. After looking in their internal knowledge base, he suggested we add "Interrupt=0" to our connection string, based on recommendations given to other customers that had the same problem.
The default value for Interrupt was 1 before v10.5 FP2 and still is for most connections. They changed the default value to 2 for connections to z/OS (mainframe) in FP2.
We're using C# and the connection string properties for the IBM Data Server Driver for .Net can be found here. I'm sure there is a similar property for their drivers for other languages.
This page from the IBM docs goes into a bit more detail about the setting.
We haven't seen the issue since we recently added the property, but it was always intermittent so I can't yet confidently say that the problem is fixed. Time will tell...
That particular error (SQL30081N) is just a generic message that indicates a network issue between your DB2 client and the server. In this case, you want to look at the Protocol specific error code(s). Here, it looks like you're on Windows, and that particular code (10004) isn't given in the IBM documentation.
So, if you google "windows network error codes", you'll find this page, which says:
WSAEINTR
10004
Interrupted function call.
A blocking operation was interrupted by a call to WSACancelBlockingCall.
Which links to this page with more information on that specific function (emphasis mine):
The WSACancelBlockingCall function has been removed in compliance
with the Windows Sockets 2 specification, revision 2.2.0.
The function is not exported directly by WS2_32.DLL and Windows
Sockets 2 applications should not use this function. Windows Sockets
1.1 applications that call this function are still supported through the WINSOCK.DLL and WSOCK32.DLL.
Blocking hooks are generally used to keep a single-threaded GUI
application responsive during calls to blocking functions. Instead of
using blocking hooks, an applications should use a separate thread
(separate from the main GUI thread) for network activity.
I'm guessing that your application may be blocking for a longer time in your production application than your other environments, and something along the way is causing the interrupt.
Hopefully this leads you down the right path...
I spent hours to solve the same problem and fixed it. I use a Windows exe (developed with C#.NET) to run a SELECT query from a DB2 database and I sometimes got this error. Finally I realized that my problem is a time out error. Error with protocol code "10004" message, sometimes occurs if query execution is longer than 30 seconds which is default timeout value. Maybe the interruption call on the "Windows Socket Error Codes" page occurs for time out mechanism. I add aline to set an acceptable timeout value and got rid off this annoying error. I hope it helps other.
Here is my code fix :
...
connDb.Open();
DB2Command cmdDb = new DB2Command(QueryText,connDb);
cmdDb.CommandTimeout = 300; //I added this line.
using (DB2DataReader readerDb = cmdDb.ExecuteReader())
{
...

How to derect Errors via Soupsession or SoupMessage Signals?

Im currently developing with WebkitGtk+ Unstable Api
I'm using Soupsession Object to conect Signals and Rertve Soupmessages to (again)
hook signals to every Message to obtain time details of network events, my problem is how to monitor errors from this point.
if I'm using just the signal, there is a way to detect when a network error like DNS error or a socket error ocurrs i searched over the SoupSession Manuals but found nothing usable.
can someone give me some guidances?
Some time ago i figured it out.
the errors are reported in the responce http code of the soup message
https://developer.gnome.org/libsoup/stable/libsoup-2.4-soup-status.html
I just needed to capture the status code in the soup message signal "finished" to know if the resource failed (and why) or if was successful

Delayed Write errors

For the past few months, we've been losing data to a Delayed Write errors. I've experienced the error with both custom code and shrink-wrap applications. For example, the error message below came from Visual Studio 2008 on building a solution
Windows - Delayed Write Failed : Windows was unable
to save all the data for the file
\Vital\Source\Other\OCHSHP\Done07\LHFTInstaller\Release\LHFAI.CAB. The
data has been lost. This error may be caused by a failure of your
computer hardware or network connection. Please try to save this file
elsewhere.
When it occurs in Adobe, Visual Studio, or Word, for example, no harm is done. The major problem is when it occurs to our custom applications (straight C apps that writes data in dBase files to a network share.)
From the program's perspective, the write succeeds. It deletes the source data, and goes on to the next record. A few minutes later, Windows pops up an error message saying that a delayed write occurred and the data was lost.
My question is, what can we do to help our networking/server teams isolate and correct the problem (read, convince them the problem is real. Simply telling them many, many times hasn't convinced them as of yet) and do you have any suggestions of how we can write to avoid the data loss?
Writes on Windows, like any modern operating system, are not actually sent to the disk until the OS gets around to it. This is a big performance win, but the problem (as you have found) is that you cannot detect errors at the time of the write.
Every operating system that does asynchronous writes also provides mechanisms for forcing data to disk. On Windows, the FlushFileBuffers or _commit function will do the trick. (One is for HANDLEs, the other for file descriptors.)
Note that you must check the return value of every disk write, and the return value of these synchronizing functions, in order to be certain the data made it to disk. Also note that these functions block and wait for the data to reach disk -- even if you are writing to a network server -- so they can be slow. Do not call them until you really need to push the data to stable storage.
For more, see fsync() Across Platforms.
You have a corrupted file system or a hard disk that is failing. The networking/server team should scan the disk to fix the former and detect the latter. Also check the error log to see if it tells you anything. If the error log indicates that failure to write to the hardware then you need to replace the disk.

Resources