Multi-GPU programming using CUDA on a NUMA Machine - c

I currently porting an algorithm to two GPUs. The hardware has the following setup:
Two CPUs as a NUMA System, so the main memory is splitted to both NUMA
nodes.
Each GPU is physically connected to one of the GPUs. (Each PCIe controller has one GPU)
I created two threads on the host to control the GPUs. The threads are bound each to a NUMA-Node, i.e. each of both threads runs on one CPU socket. How can I determine the number of the GPU such that I can select the directly connected GPU using cudaSetDevice()?

As I mentioned in the comments, this is a type of CPU GPU affinity. Here is a bash script that I hacked together. I believe it will give useful results on RHEL/CentOS 6.x OS. It probably won't work properly on many older or other linux distros. You can run the script like this:
./gpuaffinity > out.txt
You can then read out.txt in your program to determine which logical CPU cores correspond to which GPUs. For example, on a NUMA Sandy Bridge system with two 6-core processors and 4 GPUs, sample output might look like this:
0 03f
1 03f
2 fc0
3 fc0
This system has 4 GPUs, numbered from 0 to 3. Each GPU number is followed by a "core mask". The core mask corresponds to the cores which are "close" to that particular GPU, expressed as a binary mask. So for GPUs 0 and 1, the first 6 logical cores in the system (03f binary mask) are closest. For GPUs 2 and 3, the second 6 logical cores in the system (fc0 binary mask) are closest.
You can either read the file in your program, or else you can use the logic illustrated in the script to perform the same functions in your program.
You can also invoke the script like this:
./gpuaffinity -v
which will give slightly more verbose output.
Here is the bash script:
#!/bin/bash
#this script will output a listing of each GPU and it's CPU core affinity mask
file="/proc/driver/nvidia/gpus/0/information"
if [ ! -e $file ]; then
echo "Unable to locate any GPUs!"
else
gpu_num=0
file="/proc/driver/nvidia/gpus/$gpu_num/information"
if [ "-v" == "$1" ]; then echo "GPU: CPU CORE AFFINITY MASK: PCI:"; fi
while [ -e $file ]
do
line=`grep "Bus Location" $file | { read line; echo $line; }`
pcibdf=${line:14}
pcibd=${line:14:7}
file2="/sys/class/pci_bus/$pcibd/cpuaffinity"
read line2 < $file2
if [ "-v" == "$1" ]; then
echo " $gpu_num $line2 $pcibdf"
else
echo " $gpu_num $line2 "
fi
gpu_num=`expr $gpu_num + 1`
file="/proc/driver/nvidia/gpus/$gpu_num/information"
done
fi

The nvidia-smi tool can tell the topology on NUMA machine.
% nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SOC SOC 0-5
GPU1 PHB X SOC SOC 0-5
GPU2 SOC SOC X PHB 6-11
GPU3 SOC SOC PHB X 6-11
Legend:
X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

Related

How does lspci find out physical slot number of a PCI(E) device?

lspci is capable to show physical slot number in the verbose presentation:
I'd like to find out how it does it. I am going to apply this method in the driver that I would like to modify, so it would enumerate the devices (with the same ID) and disambiguate the device files according to physical slot. Like /dev/device_physslot . The driver will run on Ubuntu 18
I tried to dig in the source code. I found the relevant line 775 in https://github.com/pciutils/pciutils/blob/master/lspci.c:
if (p->phy_slot)
printf("\tPhysical Slot: %s\n", p->phy_slot);
p is struct pci_dev. That had been quite confusing because standard linux/pci.h does not have field phy_slot until I figured out that is their own (re)definition
The structure is filled by the function
int
pci_fill_info_v38(struct pci_dev *d, int flags)
{
unsigned int uflags = flags;
if (uflags & PCI_FILL_RESCAN)
{
uflags &= ~PCI_FILL_RESCAN;
pci_reset_properties(d);
}
if (uflags & ~d->known_fields)
d->methods->fill_info(d, uflags);
return d->known_fields;
}
fill_info is a function pointer defined in https://github.com/pciutils/pciutils/blob/master/lib/internal.h (line 44)
And that's where I lost track.
If you run dmidecode it will show the stored platform information, which will tell you the mapping of physical slots to PCIe address. For example:
Handle 0x001D, DMI type 9, 17 bytes
System Slot Information
Designation: J6B1
Type: x1 PCI Express
Current Usage: In Use
Length: Short
ID: 1 <== SLOT (starting at 0)
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:00:1c.3 <== PCI BUS ADDRESS
Proogrammatic list of slots:
sudo dmidecode -t 9 |awk '/ID:/ {id=$2} /Bus Address/ {print "Slot",id+1,"PCIe",$3}'
Slot 1 PCIe 0000:00:01.0
Slot 2 PCIe 0000:00:1c.3
Slot 3 PCIe 0000:00:1c.4
Slot 4 PCIe 0000:00:1c.5
Slot 5 PCIe 0000:00:1c.6
After digging deeper in the source and succeeding to run lspci in the debugger (thanks to Netbeans) I found out that lspci uses sysfs to gather information.
In particular /sys/bus/pci/slots/slot_num/address file contains the bus address of the slot. And that's what lspci uses to attribute slots to bus addresses in the function sysfs_fill_slots (in sysfs.c)
Unfortunately this method turns out to be unsuitable for my purposes, since it is not possible to perform file I/O from kernel module.

emulating the reMarkable tablet (i.MX6 ARMv7) with Qemu

I'm trying to emulate the reMarkable tablet with Qemu in order to create a proper development environment for it, instead of cross-compiling and sending to the hardware device.
The firmware flasher repo contains the rootfs, kernel, DTB and u-boot files. I've created an .img file from the rootfs in order to boot it in Qemu with the following command:
qemu-system-arm \
-M sabrelite \
-bios "files/u-boot.imx" \
-kernel "zImage" \
-append "console=ttymxc0 rootfstype=ext4 root=/dev/mmcblk1p2 rw rootwait init=/bin/bash loglevel=8 bootmem-debug earlyprintk" \
-dtb "zero-gravitas.dtb" \
-drive file="floppy.img",format=raw,id=mmcblk1p2 \
-device sd-card,drive=mmcblk1p2
but the kernel does not seem to be starting as I have the same log whether the floppy.img file (drive+device) is given or not. The startup loops on this error:
[ 0.713093] 2020000.serial: ttymxc0 at MMIO 0x2020000 (irq = 19, base_baud = 5000000) is a IMX
[ 0.732268] console [ttymxc0] enabled
[ 0.736333] phy index low: 1, phy index high: 2
[ 240.289647] INFO: task swapper:1 blocked for more than 120 seconds.
[ 240.290160] Not tainted 4.1.28-zero-gravitas-01866-ge0b823726ea4-dirty #82
[ 240.290318] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 240.290662] swapper D 8051c44c 0 1 0 0x00000000
[ 240.292245] [<8051c44c>] (__schedule) from [<8051c73c>] (schedule+0x40/0x98)
[ 240.292473] [<8051c73c>] (schedule) from [<8051e7b8>] (schedule_timeout+0x114/0x168)
[ 240.292781] [<8051e7b8>] (schedule_timeout) from [<8051d248>] (wait_for_common+0x88/0x130)
[ 240.292953] [<8051d248>] (wait_for_common) from [<80262c74>] (imx_rng_init+0x158/0x2a8)
[ 240.293117] [<80262c74>] (imx_rng_init) from [<80262574>] (set_current_rng+0xc0/0x15c)
[ 240.293276] [<80262574>] (set_current_rng) from [<80262874>] (hwrng_register+0x190/0x1b8)
[ 240.293436] [<80262874>] (hwrng_register) from [<807c3fd8>] (imx_rng_probe+0xd4/0x134)
[ 240.293682] [<807c3fd8>] (imx_rng_probe) from [<802748e0>] (platform_drv_probe+0x44/0xac)
[ 240.293852] [<802748e0>] (platform_drv_probe) from [<802735ac>] (driver_probe_device+0x178/0x2b8)
[ 240.294009] [<802735ac>] (driver_probe_device) from [<802737bc>] (__driver_attach+0x8c/0x90)
[ 240.294158] [<802737bc>] (__driver_attach) from [<80271d50>] (bus_for_each_dev+0x68/0x9c)
[ 240.294352] [<80271d50>] (bus_for_each_dev) from [<802726bc>] (bus_add_driver+0x13c/0x1e4)
[ 240.294600] [<802726bc>] (bus_add_driver) from [<80273ed4>] (driver_register+0x78/0xf8)
[ 240.294843] [<80273ed4>] (driver_register) from [<807c434c>] (__platform_driver_probe+0x20/0x70)
[ 240.295092] [<807c434c>] (__platform_driver_probe) from [<807a9d78>] (do_one_initcall+0x118/0x1c4)
[ 240.295367] [<807a9d78>] (do_one_initcall) from [<807a9f48>] (kernel_init_freeable+0x124/0x1c4)
[ 240.295609] [<807a9f48>] (kernel_init_freeable) from [<8051883c>] (kernel_init+0x8/0xe8)
[ 240.295844] [<8051883c>] (kernel_init) from [<8000ef88>] (ret_from_fork+0x14/0x2c)
full log here
I will update this question when I have new findings, but i'm new to Qemu and I'm quite stuck and ran out of options. The repository i'm working in is here. Thanks for any input !
I haven't investigated too closely, but the fact that the backtrace shows a hang in the imx_rng_init function suggests that the problem is that QEMU doesn't have an emulation of the imx SoC's builtin RNG device, and so the guest is hanging forever waiting for a response from hardware that doesn't exist.
You'll need to either implement a model of that device, or else use a guest kernel which doesn't try to probe for that device.
More generally, running an Arm kernel that's intended for one piece of hardware on a different piece of hardware will usually not work. The sabrelite has the same SoC here, so booting works better than it would if you tried to do it with an entirely unrelated QEMU machine, but if at any time your guest code tries to access hardware outside the SoC which is specific to the reMarkable then you will find it doesn't work. If you really need to get the stock kernel for the hardware to boot you will probably at some point need to bite the bullet and implement a proper machine model of it in QEMU with the relevant devices present.
If you don't actually need to run anything on the guest that cares about the specific differences between one imx6 system and another, you might be able to get away with using a kernel and DTB for the sabrelite board plus the rootfs from the reMarkable.

Can I easily compile u-boot with more commands for arm versatile bp

I have compiled u-boot from u-boot-2013.01.y branch for versatilebp board (arm), and I need fatload command that is not present in this configuation.
I'm running u-boot under qemu
DRAM: 128 MiB
WARNING: Caches not enabled
Using default environment
In: serial
Out: serial
Err: serial
Net: SMC91111-0
Warning: SMC91111-0 using MAC address from net device
VersatilePB # fat
Unknown command 'fat' - try 'help'
VersatilePB # help
? - alias for 'help'
base - print or set address offset
bdinfo - print Board Info structure
bootm - boot application image from memory
bootp - boot image via network using BOOTP/TFTP protocol
cmp - memory compare
cp - memory copy
crc32 - checksum calculation
dhcp - boot image via network using DHCP/TFTP protocol
env - environment handling commands
erase - erase FLASH memory
flinfo - print FLASH memory information
go - start application at address 'addr'
help - print command description/usage
iminfo - print header information for application image
loop - infinite loop on address range
md - memory display
mm - memory modify (auto-incrementing address)
mtest - simple RAM read/write test
mw - memory write (fill)
nm - memory modify (constant address)
ping - send ICMP ECHO_REQUEST to network host
printenv- print environment variables
protect - enable or disable FLASH write protection
reset - Perform RESET of the CPU
setenv - set environment variables
tftpboot- boot image via network using TFTP protocol
version - print monitor, compiler and linker version
VersatilePB #
I need fatload to load file containing image of fat filesystem containing kernel of freebsd. Can I somehow change compile config for that board to compile u-boot with fatload command? Or it's just not possible/not supported for that board?
Having done more or less exactly this for a Versatile AB, it's most certainly possible. The simplest way is to find where that board's command set is defined, and hack in the commands you want by defining the relevant CONFIG_CMD_* symbols. In this case, that place is include/configs/versatile.h.
Looking at my checkout of 2015.07, I seem to have added, among others (I think I was trying to convince the MMC to work at the time), these lines:
#define CONFIG_CMD_FAT
#define CONFIG_DOS_PARTITION 1

How can I identify the protocol used in hard disk?

I have an application which needs to read information from a hard disk, stuff like serial model etc.
Now of course it matters if the drive is a SAS, SATA or FC drive.
Is there a reliable way that I can identify which protocol a connected drive uses? Either via an OS command or checking some logs or inquiring the device?
I don't want to use sysfs structure. I want to know how the OS know if it's an ATA, SCSI or whatever type of disk.
As you have mentioned in comments to user3588161's answer, you are having SATA and SAS disk attached to the same SAS controller, so I'd suggest to use the smartctl command!
The smartctl command act as a control and monitor Utility for SMART disks under Linux and Unix like operating systems. Type the following command to get information about /dev/sda (SATA disk):
# smartctl -d ata -a -i /dev/sda
For SAS disk use one of the following syntax:
# smartctl -d scsi --all /dev/sgX
# smartctl -d scsi --all /dev/sg1
# smartctl -d scsi --all /dev/sg1 -H
I guess all of the information is somehow related to this location :-
/sys/class/scsi_device/?:?:?:?/device/model
I suggest you try doing this too to check what output does it render.
cat /sys/class/scsi_device/0\:0\:0\:0/device/{model,vendor}
(The backslashes next to zeros are for escaping special char :.)
Also, I'd like to suggest you to visit these two links in order for more information or detail like sample output,etc :-
Find Out Hard Disk Specs
To Check Disk behind Adaptec RAID Controllers
Checking boot information, it seems the disk type is set in kernel ahci calls. You can check (as root) with dmesg | grep ahci (on sysvinit systems) or with journalctl -k -b -0 -l --no-pager | grep ahci (with systemd). The relevant query/setting looks to be:
kernel: ahci 0000:00:12.0: version 3.0
kernel: ahci 0000:00:12.0: controller can't do 64bit DMA, forcing 32bit
kernel: ahci 0000:00:12.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
kernel: ahci 0000:00:12.0: flags: ncq sntf ilck pm led clo pmp pio slum part ccc
The third line holds the controller/type information you are looking for. This seems to be where the information comes from, but from your questions standpoint, it isn't a viable solution.
The question becomes where does this information get recorded or stored within /dev /proc or /sys. I have looked and cannot find a one-to-one correlation between this initial determination of disk type on boot and any flag stored. This information may well be part of the coded data, for example, /sys/class/scsi_disk/0:0:0:0/device or similar location. Hopefully this information may allow you or others to help pinpoint if, and if so, where this information is captured and available on a running system.
Answer rewritten in view of clarification: libATA is what you want. It's what hdparm calls and it reports the transport too. It's hard to find up to date docs on it though. See http://docs.huihoo.com/linux/kernel/2.6.26/libata/index.html for example.
I have not used libATA (directly) myself, so I can't be more specific as to the API calls needed. Since not many people need to write something like hdparm themselves, your best bet is to consult its sources to see what exactly it calls.
hdparm can report stuff like:
[root#alarmpi ~]# hdparm -I /dev/sdb
/dev/sdb:
ATA device, with non-removable media
Model Number: TOSHIBA DT01ACA200
Serial Number: Z36GKMKGS
Firmware Revision: MX4OABB0
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0; Revision: ATA8-AST T13 Project D1697 Revision 0b
If your actual problem is that only sdparm works on your system for SCSI drives (can happen) then it seems the problem is reduced to figuring out which of hdparm or sdparm to call isn't it? You could use udevinfo for that. See https://chromium.googlesource.com/chromiumos/third_party/laptop-mode-tools/+/775acea9e819bdee90cca8d2363827c13967a14b/laptop-mode-tools_1.52/usr/share/laptop-mode-tools/modules/hdparm for example.

Running openmp on cluster

I have to run an openmp program on a cluster with different configuration (such as different number of nodes).
But the problem I am facing is that whenever I am trying to run the program with say 2 nodes then the same piece of program runs 2 times instead of running in parallel.
My program -
gettimeofday(&t0, NULL);
for (k=0; k<size; k++) {
#pragma omp parallel for shared(A)
for (i=k+1; i<size; i++) {
//parallel code
}
#pragma omp barrier
for (i=k+1; i<size; i++) {
#pragma omp parallel for
//parallel code
}
}
gettimeofday(&t1, NULL);
printf("Did %u calls in %.2g seconds\n", i, t1.tv_sec - t0.tv_sec + 1E-6 * (t1.tv_usec - t0.tv_usec));
It is an LU decomposition program.
When I am running it on 2 node then I am getting output something like this -
Did 1000 calls in 5.2 seconds
Did 1000 calls in 5.3 seconds
Did 2000 calls in 41 seconds
Did 2000 calls in 41 seconds
As you see each the program is run two times for each value (1000,2000,3000...) instead of running in parallel.
It is my homework program but I am stuck at this point.
I am using SLURM script to run this program on my college computing cluster. This is the standard script provided by the professor.
#!/bin/sh
##SBATCH --partition=general-compute
#SBATCH --time=60:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
##SBATCH --mem=24000
# Memory per node specification is in MB. It is optional.
# The default limit is 3GB per core.
#SBATCH --job-name="lu_openmpnew2nodes"
#SBATCH --output=luopenmpnew1node2task.out
#SBATCH --mail-user=***#***.edu
#SBATCH --mail-type=ALL
##SBATCH --requeue
#Specifies that the job will be requeued after a node failure.
#The default is that the job will not be requeued.
echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES
echo "SLURMTMPDIR="$SLURMTMPDIR
cd $SLURM_SUBMIT_DIR
echo "working directory = "$SLURM_SUBMIT_DIR
module list
ulimit -s unlimited
#
echo "Launch luopenmp with srun"
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
for i in {1000..20000..1000}
do
srun ./openmpNew "$i"
done
#
echo "All Done!"
Be careful, you are confusing MPI and OpenMP here.
OpenMP works with Threads, i.e. on shared memory which do not communicate over several nodes of a distributed memory system (there exist some techniques to do so, but they are not performant enough).
What you are doing is starting the same program on two nodes each. If you where using MPI, this would be fine. But in your case you start two processes with a default number of threads. Those two processes are independent of each other.
I would suggest some further studies on the topics of Shared Memory Parallelization programming (like OpenMP) and Distributed Memory Parallelization (like MPI). There's tons of tutorials out there, and I would recommend the book "Introduction to High Performance Computing for Scientists and Engineers," by Hager and Wellein.
To try your program, start on one node, and specify OMP_NUM_THREADS like:
OMP_NUM_THREADS=1 ./openmpNew "$i"
OMP_NUM_THREADS=2 ./openmpNew "$i"
...
Here is an example script for SLURM: link.

Resources