modprobe fails to insert beegfs after installing mellanox drivers - kernel-module

I have a storage cluster that has been churning along for a few years. It's based around a pretty stock Centos 7.6 setup, using beegfs.
In an effort to increase throughput I've decided to do a test-upgrade of the network, from 10gig to 40gig. However, it would appear that the necessary drivers for this 40gig card conflicts with beegfs in terms of kernel modules. Now that I have the 40gig network running successfully, beegfs-client fails to start:
modprobe: ERROR: could not insert 'beegfs': Unknown symbol in module, or unknown parameter (see dmesg)
How do I make these two get along?
The cards I've installed are all ConnectX-3 FDR Infiniband (both ports configured to Ethernet, though). The driver I installed is MLNX_OFED_LINUX-5.0-2.1.8.0-rhel7.6-x86_64. Uninstalling the driver did not resolve the issue, but the 40gig network is still working. It was only needed for recorfiguring the ports to Ethernet instead of Infiniband.
Update: From the looks of it, I will need to add infiniband support to the beegfs-client-autobuild.conf. Not entirely sure where to find the source that I need to reference.

Turns out the answer was simpler than anticipated: upgrade to the newest version of beegfs-client. The newer version includes infiniband compatibility by default. No rebuild needed.
After an upgrade and a reboot, the cluster behaved as intended again, with the Mellanox 40Gb/s cards operating.

Related

Creating PyOpenCl context causes later access violation

I just started to experiment with OpenCL using pyopencl. I got it from here
http://www.lfd.uci.edu/~gohlke/pythonlibs/#pyopencl
I wrote some test programms and excecuting worked as expected. Than when I wanted to start a big batch of simulations I got random crashes with Access Violations (Windows Error Code FFFFFFFFC0000005) . It turns out that any script in which I import pyopencl and create a context crashes after one to two minute. I run 3 tests and got [63sec, 86sec, 81sec].
I have ensured that the context is always on my 'Intel(R) HD Graphics 620' card by setting the environment variable PYOPENCL_CTX = 1:0
import pyopencl as cl
ctx = cl.create_some_context()
import time
i = 0
while True:
print("Im alive since %i seconds" % i, flush = True)
i += 1
time.sleep(1.0)
I have a Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] from python.org
Edit:
After removing the environment variable and just letting the shell sit in the choose platform dialog - wihtout choosing a platform - crashes after about a minute with Access Violation.
Edit:
I updated the Driver Vesion to the latest available 22.20.16.4771 (from 13/08/2017) but the problem persists.
This may be due to outdated driver. In the case that the computer vendor does not provide up-to-date drivers a generic driver from Intel may work, eventhough there is no guarantee of compatibility.
Intel only provides driver as self installing .exe which don't allow driver installation when it notices that a custom version of there drivers is running.
However one can circumvent this check by letting the self uncompressing .exe uncrompess. Copy the data from the temp folder it creates and then manually install these drivers usind the widnows device manager. It goes without saying that this easily may break a setup.

WDK Driver load issue (The service cannot be started, either because it is disabled)

I have used windows 8.1 to write many drivers with no issues when loading what so ever. There seems to be some sort of issue when I try to load a new basic KMDF driver that I built in visual studio. I am able to edit source and compile new versions of driver projects built while on previous versions of windows and I assume WDK would be the true culprit here. I am able to load drivers that the original project was generated in Windows 8.1 even if I edit the source and recompile, but specifically If I try to create a new driver project through visual studio, namely the example base for Kernel Mode Driver, it fails to load with the error :
"The service cannot be started, either because it is disabled or because it has no enabled devices associated with it"
A couple points :
The driver fails to load with the same error every time, I have my own certified trusted certificate from digicert and I have tried disabling driver signature enforcement, both with the same error. So it is safe to say that certificates is not the issue.
The only main difference I can tell between the old and new WDK sources is the old version specifically has versions of windows to build from, but the new has "universal" although through settings it looks like it will just build for Windows 10.
I am not doing any stupid errors meaning, I am compiling x64, etc...
I'm starting to think that the WDK KMDF basic template may have some sort of issue with it.
I would rather not have to gut an old project (driver) to get a successful "new" driver to load.
Can you please specify is it a legacy driver or a pnp driver.
I faced a similar issue, but the mistake I was doing was compiling a pnp driver and trying to load it as a legacy driver.
To specify the difference for completion sake pnp would be a driver that comes with a AddDevice routine. Such driver are expected to have a start type as 0 and are loaded at boot time. Need to attach the driver to a specific device object in the add device routine.
The legacy drivers are one with no AddDevice routine and we call IoCreateDevice from DriverEntry itself.

Vtune report Outside any known module

I am using Intel(R) VTune(TM) Amplifier XE 2013 Update 5 (build 274450) for my linux application hotspot collect, but the report says the "[Outside any known module]" consume most of the time, so i want to get more info about the unknow module.
when i read the release notes of the vtune Amplifier, it says "List of hotspots may contain "Outside any known module" on systems with kernel older than 2.6.20 (200233501)", but my linux kernel is "2.6.32", any idea about this?
Check that your program is not generating code on the fly (i.e. is not JIT-ing). You may also want to switch grouping to "Module / Code Location / Call stack" and see which virtual addresses cannot be mapped by VTune to any known module.
I have been suffering from this issue in the past as well and it is very frustrating if you don't know why it is happening.
2 weeks ago I installed Ubuntu 13.04 and vtune update 14 and I was jumping from joy because I could see (again) what happened inside my code.
After doing some updates on my Ubuntu, vtune started to show your problem
I installed the kernel sources.. no help..
I reinstalled the driver, no help.
I reinstalled intel vtune.. no help.
And then I decided to run under root, and what do you know.. It works; no more 'Outside any known module'. I switched back to my regular user, it stopped working. I switched back to my root, and it works. So perhaps some kinds of access issue.
Maybe you could try this.
Probably you have some kernel hidden addresses by kptr_restrict, you ca review the value of "/proc/sys/kernel/kptr_restrict":
kptr_restrict = 0, kernel addresses are provided without limitations (recommended).
kptr_restict = 1, addresses are provided if the current user has a CAP_SYSLOG capability.
kptr_restrict = 2, the kernel addresses are hidden regardless of privileges the current user has.
You can use this option before running the trace:
sysctl -w kernel.kptr_restrict=0
More details here: https://software.intel.com/en-us/vtune-help-enabling-linux-kernel-analysis
Hope this helps!

Windows display driver hooking, 64 bit

Once I've written a sort of a driver for Windows, which had to intercept the interaction of the native display driver with the OS. The native display driver consists of a miniport driver and a DLL loaded by win32k.sys into the session space. My goal was to meddle between the win32k.sys and that DLL. Moreover, the system might have several display drivers, I had to hook them all.
I created a standard WDM driver, which was configured to load at system boot (i.e. before win32k). During its initialization it hooked the ZwSetSystemInformation, by patching the SSDT. This function is called by the OS whenever it loads/unloads a DLL into the session space, which is exactly what I need.
When ZwSetSystemInformation is invoked with SystemLoadImage parameter - one of its parameters is the pointer to a SYSTEM_LOAD_IMAGE structure, and its ModuleBase is the module base mapping address. Then I analyze the mapped image, patch its entry point with my function, and the rest is straightforward.
Now I need to port this driver to a 64-bit Windows. Needless to say it's not a trivial task at all. So far I found the following obstacles:
All drivers must be signed
PatchGuard
SSDT is not directly exported.
If I understand correctly, PatchGuard and driver signing verification may be turned off, the driver should be installed on a dedicated machine, and we may torture it the way we want.
There're tricks to locate the SSDT as well, according to online sources.
However recently I've discovered there exists a function called PsSetLoadImageNotifyRoutine. It may simplify the task considerably, and help avoid dirty tricks.
My question are:
If I use PsSetLoadImageNotifyRoutine, will I receive notifications about DLLs loaded into the session space? The official documentation talks about "system space or user space", but does "system space" also includes the session space?
Do I need to disable the PatchGuard if I'm going to patch the mapped DLL image after it was mapped?
Are there any more potential problems I didn't think about?
Are there any other ways to achieve what I want?
Thanks in advance.
Do I need to disable the PatchGuard if I'm going to patch the mapped DLL image after it was mapped?
To load any driver on x64 it must be signed. With admin rights you can disabled PatchGuard and I personally recommend using DSEO, a GUI application made for this. Or you can bypass PatchGuard by overwriting the MBR (or BIOS), although this is typically considered a bootkit - malware.

SNMPd: Cannot open /proc/bus/pci

I cross-compiled NET-SNMP 5.7.1 from sources to a PowerPC using ELDK-3.1.
When I try to load the snmpd daemon in my embedded board, I see the message:
# snmpd -f -Lo
pcilib: Cannot open /proc/bus/pci
pcilib: Cannot find any working access method.
Of course my PPC board has no PCI, and I wonder why is netsnmp looking for it.
In more than one place I see this same message (sourceforge, mail-archive, google-groups), but ir has no answer at all. Another variant, with a little but unhelpful responses at (archlinuxarm).
Can anybody please help me?
I'm assuming you're on a Linux target.
Net-SNMP's changelog lists "[PATCH 3057093]: allow linux to use libpci for creating useful ifDescr strings".
The configure script will search for an available libpci, and, having found one, will define
HAVE_PCI_LOOKUP_NAME and HAVE_PCI_PCI_H. To disable this code: after configuring, you can change those defines in include/net-snmp/net-snmp-config.h, then rebuild. The affected code is in agent/mibgroup/if-mib/data_access/interface_linux.c.
There's also a patch in this bug report: http://sourceforge.net/p/net-snmp/bugs/2449/
I resolved the issue using the stock snmpd that comes with the Raspbian.
In /etc/snmp/snmpd.conf file I isolated the issue to the following line
agentAddress udp:161,udp6:[::1]:161
Instead of listening on all interfaces, if I specify the the ip address of the eth0 interface i.e.:
agentAddress udp:10.0.1.5:161,udp6:[::1]:161
Then snmpd starts fine.
My speculation is that the stock snmpd tries to enumerate all possible interfaces including the pci ones.

Resources