RoCE connection problem with MLNX_OFED (RDMA over Converged Ethernet) - c

I am trying to get RoCe (RDMA over converged ethernet) to work on two workstations. I have installed MLNX_OFED on both computers which are equiped with Mellanox ConnectX-5 EN 100GbE adapters and connected directly to each other via corresponding cable. According to what I've read I need the Subnet Manager to be running on one of the workstations for it to be able to use RoCe between them.
When trying to run the command opensm it says it finds no local ports. I can ping both computers and it works. I can also use the command udaddy to test the RDMA and it also works. But trying to run the RDMA_RC_EXAMPLE presented in this guide https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf it fails when trying to create the Queue Pairs, more specificly when it tries to change state to RTR (ready to receive).
Also some sources says that you need a RDMA service, which does not exists on my computer. And the installation of MLNX_OFED puts an exclude of ibutils-libs* in /etc/yum.conf, which I don't know if it's relevant but I noticed it.
I am running CentOS 7.7 on one of the machines and CentOS 7.8 on the other.
I'm a bit puzzled over what's faulty.
Update
Here's the function that breaks when running the code.
/******************************************************************************
* Function: modify_qp_to_rtr
*
* Input
* qp QP to transition
* remote_qpn remote QP number
* dlid destination LID
* dgid destination GID (mandatory for RoCEE)
*
* Output
* none
*
* Returns
* 0 on success, ibv_modify_qp failure code on failure
*
* Description
* Transition a QP from the INIT to RTR state, using the specified QP number
******************************************************************************/
static int modify_qp_to_rtr (struct ibv_qp *qp, uint32_t remote_qpn, uint16_t dlid, uint8_t * dgid)
{
struct ibv_qp_attr attr;
int flags;
int rc;
memset (&attr, 0, sizeof (attr));
attr.qp_state = IBV_QPS_RTR;
attr.path_mtu = IBV_MTU_256;
attr.dest_qp_num = remote_qpn;
attr.rq_psn = 0;
attr.max_dest_rd_atomic = 1;
attr.min_rnr_timer = 0x12;
attr.ah_attr.is_global = 0;
attr.ah_attr.dlid = dlid;
attr.ah_attr.sl = 0;
attr.ah_attr.src_path_bits = 0;
attr.ah_attr.port_num = config.ib_port;
if (config.gid_idx >= 0)
{
attr.ah_attr.is_global = 1;
attr.ah_attr.port_num = 1;
memcpy (&attr.ah_attr.grh.dgid, dgid, 16);
attr.ah_attr.grh.flow_label = 0;
attr.ah_attr.grh.hop_limit = 1;
attr.ah_attr.grh.sgid_index = config.gid_idx;
attr.ah_attr.grh.traffic_class = 0;
}
flags = IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER;
rc = ibv_modify_qp (qp, &attr, flags);
if (rc)
fprintf (stderr, "failed to modify QP state to RTR\n");
return rc;
}
This source RoCE Debug flow Linux says that
RoCE requires an MTU of at least 1024 bytes for net payload.
Which I guess would affect this row in the code:
attr.path_mtu = IBV_MTU_256;
When changing to IBV_MTU_1024 it compiles but gives the same error.

Solved
When using RoCE you need to specify the GID index as explained by #haggai_e. I don't think it's needed when using Infiniband instead of RoCE.
When running the qperf RDMA test we also needed to set the connection manager flag for it to run. I would assume it has the same function as the Subnet Manager that needs to be used when using Infiniband instead of RoCE.

Related

Proper use of `nalu_process` callback in x264

I wish to make use of libx264's low-latency encoding mechanism, whereby a user-provided callback is called as soon as a single NAL unit is available instead of having to wait for a whole frame to be encoded before starting processing.
The x264 documentation states the following about that facility:
/* Optional low-level callback for low-latency encoding. Called for each output NAL unit
* immediately after the NAL unit is finished encoding. This allows the calling application
* to begin processing video data (e.g. by sending packets over a network) before the frame
* is done encoding.
*
* This callback MUST do the following in order to work correctly:
* 1) Have available an output buffer of at least size nal->i_payload*3/2 + 5 + 64.
* 2) Call x264_nal_encode( h, dst, nal ), where dst is the output buffer.
* After these steps, the content of nal is valid and can be used in the same way as if
* the NAL unit were output by x264_encoder_encode.
*
* This does not need to be synchronous with the encoding process: the data pointed to
* by nal (both before and after x264_nal_encode) will remain valid until the next
* x264_encoder_encode call. The callback must be re-entrant.
*
* This callback does not work with frame-based threads; threads must be disabled
* or sliced-threads enabled. This callback also does not work as one would expect
* with HRD -- since the buffering period SEI cannot be calculated until the frame
* is finished encoding, it will not be sent via this callback.
*
* Note also that the NALs are not necessarily returned in order when sliced threads is
* enabled. Accordingly, the variable i_first_mb and i_last_mb are available in
* x264_nal_t to help the calling application reorder the slices if necessary.
*
* When this callback is enabled, x264_encoder_encode does not return valid NALs;
* the calling application is expected to acquire all output NALs through the callback.
*
* It is generally sensible to combine this callback with a use of slice-max-mbs or
* slice-max-size.
*
* The opaque pointer is the opaque pointer from the input frame associated with this
* NAL unit. This helps distinguish between nalu_process calls from different sources,
* e.g. if doing multiple encodes in one process.
*/
void (*nalu_process)( x264_t *h, x264_nal_t *nal, void *opaque );
This seems straight forward enough. However, when I run the following dummy code, I get a segfault on the marked line. I've tried to add some debugging to x264_nal_encode itself to understand where it goes wrong, but it seems to be the function call itself that results in a segfault. Am I missing something here? (Let's ignore the fact that the use of assert probably makes cb non-reentrant – it's only there to indicate to the reader that my workspace buffer is more than large enough.)
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <x264.h>
#define WS_SIZE 10000000
uint8_t * workspace;
void cb(x264_t * h, x264_nal_t * nal, void * opaque)
{
assert((nal->i_payload*3)/2 + 5 + 64 < WS_SIZE);
x264_nal_encode(h, workspace, nal); // Segfault here.
// Removed: Process nal.
}
int main(int argc, char ** argv)
{
uint8_t * fake_frame = malloc(1280*720*3);
memset(fake_frame, 0, 1280*720*3);
workspace = malloc(WS_SIZE);
x264_param_t param;
int status = x264_param_default_preset(&param, "ultrafast", "zerolatency");
assert(status == 0);
param.i_csp = X264_CSP_RGB;
param.i_width = 1280;
param.i_height = 720;
param.i_threads = 1;
param.i_lookahead_threads = 1;
param.i_frame_total = 0;
param.i_fps_num = 30;
param.i_fps_den = 1;
param.i_slice_max_size = 1024;
param.b_annexb = 1;
param.nalu_process = cb;
status = x264_param_apply_profile(&param, "high444");
assert(status == 0);
x264_t * h = x264_encoder_open(&param);
assert(h);
x264_picture_t pic;
status = x264_picture_alloc(&pic, param.i_csp, param.i_width, param.i_height);
assert(pic.img.i_plane == 1);
x264_picture_t pic_out;
x264_nal_t * nal; // Not used. We process NALs in cb.
int i_nal;
for (int i = 0; i < 100; ++i)
{
pic.i_pts = i;
pic.img.plane[0] = fake_frame;
status = x264_encoder_encode(h, &nal, &i_nal, &pic, &pic_out);
}
x264_encoder_close(h);
x264_picture_clean(&pic);
free(workspace);
free(fake_frame);
return 0;
}
Edit: The segfault happens the first time cb calls x264_nal_encode. If I switch to a different preset, where more frames are encoded before the first callback happens, then several successful calls to x264_encoder_encode are made before the first callback, and hence segfault, occurs.
After discussions with x264 developers on IRC, it seems that the behavior I was seeing is, in fact, a bug in x264. The x264_t * h passed to the callback is incorrect. If one overrides that handle with the good one (the one obtained from x264_encoder_open), things work fine.
I identified x264 git commit 71ed44c7312438fac7c5c5301e45522e57127db4 as the first bad one. The bug is documented as this x264 issue.
Update for future readers: I believe this issue has been fixed in commit 544c61f082194728d0391fb280a6e138ba320a96.

Cannot open USB device with libusb-1.0 in cygwin

I'm trying to interface with a USB peripheral using libusb-1.0 in cygwin.
libusb_get_device_list(...) works fine, I get a list of USB devices. It finds the device with the correct VendorID and ProductID in the device list, but when libusb_open(...) is called with that device, it always fails with the error code LIBUSB_ERROR_NOT_FOUND.
I don't think it's a permission issue, I've tried running this as admin, and there's a separate error code (LIBUSB_ERROR_ACCESS) for that. This same code works with libusb-1.0 in Linux.
unsigned init_usb(int vendor_id, int product_id, int interface_num)
{
int ret = libusb_init(NULL);
if (ret < 0) return CONTROL_ERROR;
libusb_device **devs = NULL;
int num_dev = libusb_get_device_list(NULL, &devs);
libusb_device *dev = NULL;
for (int i = 0; i < num_dev; i++) {
struct libusb_device_descriptor desc;
libusb_get_device_descriptor(devs[i], &desc);
if (desc.idVendor == vendor_id && desc.idProduct == product_id) {
dev = devs[i];
break;
}
}
if (dev == NULL) return CONTROL_ERROR;
libusb_device_handle *devh = NULL;
ret = libusb_open(dev, &devh);
//ret is always -5 here (in cygwin)!
if (ret < 0) return CONTROL_ERROR;
libusb_free_device_list(devs, 1);
return CONTROL_SUCCESS;
}
It turns out this was a kind of driver issue. I had to tell Windows to associate the particular device I'm using with the libusb drivers.
libusb-win32-1.2.6.0 comes with some tools to make that association (although you may need to configure your system to allow the installation of unsigned drivers).
There's one tricky bit. If you just want to associate the device with libusb, you can use the inf-wizard.exe tool to make that association, but that will change the primary association to be with libusb. In my case, the device is a USB Audio Class device (i.e. USB sound card) that also has some libusb functionality. When I used inf-wizard.exe, libusb started working (yay!), but then it stopped working as an audio device.
In my case, I needed to use the install-filter-win.exe tool to install a filter driver for libusb. That allows the device to still show up as a USB Audio device, but also interact with it using libusb.

C on embedded system w/ linux kernel - mysterious adc read issue

I'm developing on an AD Blackfin BF537 DSP running uClinux. I have a total of 32MB SD-RAM available. I have an ADC attached, which I can access using a simple, blocking call to read().
The most interesting part of my code is below. Running the program seems to work just fine, I get a nice data package that I can fetch from the SD-card and plot. However, if I comment out the float calculation part (as noted in the code), I get only zeroes in the ft_all.raw file. The same occurs if I change optimization level from -O3 to -O0.
I've tried countless combinations of all sorts of things, and sometimes it works, sometimes it does not - earlier (with minor modifications to below), the code would only work when optimization was disabled. It may also break if I add something else further down in the file.
My suspicion is that the data transferred by the read()-function may not have been transferred fully (is that possible, even though it returns the correct number of bytes?). This is also the first time I initialize pointers using direct memory adresses, and I have no idea how the compiler reacts to this - perhaps I missed something, here?
I've spent days on this issue now, and I'm getting desperate - I would really appreciate some help on this one! Thanks in advance.
// Clear the top 16M memory for data processing
memset((int *)0x01000000,0x0000,(size_t)SIZE_16M);
/* Prep some pointers for data processing */
int16_t *buffer;
int16_t *buf16I, *buf16Q;
buffer = (int16_t *)(0x1000000);
buf16I = (int16_t *)(0x1600000);
buf16Q = (int16_t *)(0x1680000);
/* Read data from ADC */
int rbytes = read(Sportfd, (int16_t*)buffer, 0x200000);
if (rbytes != 0x200000) {
printf("could not sample data! %X\n",rbytes);
goto end;
} else {
printf("Read %X bytes\n",rbytes);
}
FILE *outfd;
int wbytes;
/* Commenting this region results in all zeroes in ft_all.raw */
float a,b;
int c;
b = 0;
for (c = 0; c < 1000; c++) {
a = c;
b = b+pow(a,3);
}
printf("b is %.2f\n",b);
/* Only 12 LSBs of each 32-bit word is actual data.
* First 20 bits of nothing, then 12 bits I, then 20 bits
* nothing, then 12 bits Q, etc...
* Below, the I and Q parts are scaled with a factor of 16
* and extracted to buf16I and buf16Q.
* */
int32_t *buf32;
buf32 = (int32_t *)buffer;
uint32_t i = 0;
uint32_t n = 0;
while (n < 0x80000) {
buf16I[i] = buf32[n] << 4;
n++;
buf16Q[i] = buf32[n] << 4;
i++;
n++;
}
printf("Saving to /mnt/sd/d/ft_all.raw...");
outfd = fopen("/mnt/sd/d/ft_all.raw", "w+");
if (outfd == NULL) {
printf("Could not open file.\n");
}
wbytes = fwrite((int*)0x1600000, 1, 0x100000, outfd);
fclose(outfd);
if (wbytes < 0x100000) {
printf("wbytes not correct (= %d) \n", (int)wbytes);
}
printf(" done.\n");
Edit: The code seems to work perfectly well if I use read() to read data from a simple file rather than the ADC. This leads me to believe that the rather hacky-looking code when extracting the I and Q parts of the input is working as intended. Inspecting the assembly generated by the compiler confirms this.
I'm trying to get in touch with the developer of the ADC driver to see if he has an explanation of this behaviour.
The ADC is connected through a SPORT, and is opened as such:
sportfd = open("/dev/sport1", O_RDWR);
ioctl(sportfd, SPORT_IOC_CONFIG, spconf);
And here are the options used when configuring the SPORT:
spconf->int_clk = 1;
spconf->word_len = 32;
spconf->serial_clk = SPORT_CLK;
spconf->fsync_clk = SPORT_CLK/34;
spconf->fsync = 1;
spconf->late_fsync = 1;
spconf->act_low = 1;
spconf->dma_enabled = 1;
spconf->tckfe = 0;
spconf->rckfe = 1;
spconf->txse = 0;
spconf->rxse = 1;
A bfin_sport.h file from Analog Devices is also included: https://gist.github.com/tausen/5516954
Update
After a long night of debugging with the previous developer on the project, it turned out the issue was not related to the code shown above at all. As Chris suggested, it was indeed an issue with the SPORT driver and the ADC configuration.
While debugging, this error messaged appeared whenever the data was "broken": bfin_sport: sport ffc00900 status error: TUVF. While this doesn't make much sense in the application, it was clear from printing the data, that something was out of sync: the data in buffer was on the form 0x12000000,0x34000000,... rather than 0x00000012,0x00000034,... whenever the status error was shown. It seems clear then, why buf16I and buf16Q only contained zeroes (since I am extracting the 12 LSBs).
Putting in a few calls to usleep() between stages of ADC initialization and configuration seems to have fixed the issue - I'm hoping it stays that way!

Issue with SPI (Serial Port Comm), stuck on ioctl()

I'm trying to access a SPI sensor using the SPIDEV driver but my code gets stuck on IOCTL.
I'm running embedded Linux on the SAM9X5EK (mounting AT91SAM9G25). The device is connected to SPI0. I enabled CONFIG_SPI_SPIDEV and CONFIG_SPI_ATMEL in menuconfig and added the proper code to the BSP file:
static struct spi_board_info spidev_board_info[] {
{
.modalias = "spidev",
.max_speed_hz = 1000000,
.bus_num = 0,
.chips_select = 0,
.mode = SPI_MODE_3,
},
...
};
spi_register_board_info(spidev_board_info, ARRAY_SIZE(spidev_board_info));
1MHz is the maximum accepted by the sensor, I tried 500kHz but I get an error during Linux boot (too slow apparently). .bus_num and .chips_select should correct (I also tried all other combinations). SPI_MODE_3 I checked the datasheet for it.
I get no error while booting and devices appear correctly as /dev/spidevX.X. I manage to open the file and obtain a valid file descriptor. I'm now trying to access the device with the following code (inspired by examples I found online).
#define MY_SPIDEV_DELAY_USECS 100
// #define MY_SPIDEV_SPEED_HZ 1000000
#define MY_SPIDEV_BITS_PER_WORD 8
int spidevReadRegister(int fd,
unsigned int num_out_bytes,
unsigned char *out_buffer,
unsigned int num_in_bytes,
unsigned char *in_buffer)
{
struct spi_ioc_transfer mesg[2] = { {0}, };
uint8_t num_tr = 0;
int ret;
// Write data
mesg[0].tx_buf = (unsigned long)out_buffer;
mesg[0].rx_buf = (unsigned long)NULL;
mesg[0].len = num_out_bytes;
// mesg[0].delay_usecs = MY_SPIDEV_DELAY_USECS,
// mesg[0].speed_hz = MY_SPIDEV_SPEED_HZ;
mesg[0].bits_per_word = MY_SPIDEV_BITS_PER_WORD;
mesg[0].cs_change = 0;
num_tr++;
// Read data
mesg[1].tx_buf = (unsigned long)NULL;
mesg[1].rx_buf = (unsigned long)in_buffer;
mesg[1].len = num_in_bytes;
// mesg[1].delay_usecs = MY_SPIDEV_DELAY_USECS,
// mesg[1].speed_hz = MY_SPIDEV_SPEED_HZ;
mesg[1].bits_per_word = MY_SPIDEV_BITS_PER_WORD;
mesg[1].cs_change = 1;
num_tr++;
// Do the actual transmission
if(num_tr > 0)
{
ret = ioctl(fd, SPI_IOC_MESSAGE(num_tr), mesg);
if(ret == -1)
{
printf("Error: %d\n", errno);
return -1;
}
}
return 0;
}
Then I'm using this function:
#define OPTICAL_SENSOR_ADDR "/dev/spidev0.0"
...
int fd;
fd = open(OPTICAL_SENSOR_ADDR, O_RDWR);
if (fd<=0) {
printf("Device not found\n");
exit(1);
}
uint8_t buffer1[1] = {0x3a};
uint8_t buffer2[1] = {0};
spidevReadRegister(fd, 1, buffer1, 1, buffer2);
When I run it, the code get stuck on IOCTL!
I did this way because, in order to read a register on the sensor, I need to send a byte with its address in it and then get the answer back without changing CS (however, when I tried using write() and read() functions, while learning, I got the same result, stuck on them).
I'm aware that specifying .speed_hz causes a ENOPROTOOPT error on Atmel (I checked spidev.c) so I commented that part.
Why does it get stuck? I though it can be as the device is created but it actually doesn't "feel" any hardware. As I wasn't sure if hardware SPI0 corresponded to bus_num 0 or 1, I tried both, but still no success (btw, which one is it?).
UPDATE: I managed to have the SPI working! Half of it.. MOSI is transmitting the right data, but CLK doesn't start... any idea?
When I'm working with SPI I always use an oscyloscope to see the output of the io's. If you have a 4 channel scope ypu can easily debug the issue, and find out if you're axcessing the right io's, using the right speed, etc. I usually compare the signal I get to the datasheet diagram.
I think there are several issues here. First of all SPI is bidirectional. So if yo want to send something over the bus you also get something. Therefor always you have to provide a valid buffer to rx_buf and tx_buf.
Second, all members of the struct spi_ioc_transfer have to be initialized with a valid value. Otherwise they just point to some memory address and the underlying process is accessing arbitrary data, thus leading to unknown behavior.
Third, why do you use a for loop with ioctl? You already tell ioctl you haven an array of spi_ioc_transfer structs. So all defined transaction will be performed with one ioctl call.
Fourth ioctl needs a pointer to your struct array. So ioctl should look like this:
ret = ioctl(fd, SPI_IOC_MESSAGE(num_tr), &mesg);
You see there is room for improvement in your code.
This is how I do it in a c++ library for the raspberry pi. The whole library will soon be on github. I'll update my answer when it is done.
void SPIBus::spiReadWrite(std::vector<std::vector<uint8_t> > &data, uint32_t speed,
uint16_t delay, uint8_t bitsPerWord, uint8_t cs_change)
{
struct spi_ioc_transfer transfer[data.size()];
int i = 0;
for (std::vector<uint8_t> &d : data)
{
//see <linux/spi/spidev.h> for details!
transfer[i].tx_buf = reinterpret_cast<__u64>(d.data());
transfer[i].rx_buf = reinterpret_cast<__u64>(d.data());
transfer[i].len = d.size(); //number of bytes in vector
transfer[i].speed_hz = speed;
transfer[i].delay_usecs = delay;
transfer[i].bits_per_word = bitsPerWord;
transfer[i].cs_change = cs_change;
i++
}
int status = ioctl(this->fileDescriptor, SPI_IOC_MESSAGE(data.size()), &transfer);
if (status < 0)
{
std::string errMessage(strerror(errno));
throw std::runtime_error("Failed to do full duplex read/write operation "
"on SPI Bus " + this->deviceNode + ". Error message: " +
errMessage);
}
}

How can I access netstat-like Ethernet statistics from a Windows program

How can I access Ethernet statistics from C/C++ code like netstat -e?
Interface Statistics
Received Sent
Bytes 21010071 15425579
Unicast packets 95512 94166
Non-unicast packets 12510 7
Discards 0 0
Errors 0 3
Unknown protocols 0
The WMI will provide those readings:
SELECT * FROM Win32_PerfFormattedData_Tcpip_IP
SELECT * FROM Win32_PerfFormattedData_Tcpip_TCP
SELECT * FROM Win32_PerfFormattedData_Tcpip_UDP
SELECT * FROM Win32_PerfFormattedData_Tcpip_ICMP
SELECT * FROM Win32_PerfFormattedData_Tcpip_Networkinterface
These classes are available on Windows XP or newer. You may have to resign to the matching "Win32_PerfRawData" classes on Windows 2000, and do a little bit more of math before you can display the output.
Find documentation on all of them in the MSDN.
A good place to start for network statistics would be the GetIpStatistics call in the Windows IPHelper functions.
There are a couple of other approaches that are possibly more portable:-
SNMP. Requires SNMP to be enabled on the computer, but can obviously be used to retrieve statistics for remote computers also.
Pipe the output of 'netstat' into your application, and unpick the values from the text.
Let me answer to myself, as I asked the same on another forum.
WMI is good, but it's easier to use IpHlpApi instead:
#include <winsock2.h>
#include <iphlpapi.h>
int main(int argc, char *argv[])
{
PMIB_IFTABLE pIfTable;
MIB_IFROW ifRow;
PMIB_IFROW pIfRow = &ifRow;
DWORD dwSize = 0;
// first call returns the buffer size needed
DWORD retv = GetIfTable(pIfTable, &dwSize, true);
if (retv != ERROR_INSUFFICIENT_BUFFER)
WriteErrorAndExit(retv);
pIfTable = (MIB_IFTABLE*)malloc(dwSize);
retv = GetIfTable(pIfTable, &dwSize, true);
if (retv != NO_ERROR)
WriteErrorAndExit(retv);
// Get index
int i,j;
printf("\tNum Entries: %ld\n\n", pIfTable->dwNumEntries);
for (i = 0; i < (int) pIfTable->dwNumEntries; i++)
{
pIfRow = (MIB_IFROW *) & pIfTable->table[i];
printf("\tIndex[%d]:\t %ld\n", i, pIfRow->dwIndex);
printf("\tInterfaceName[%d]:\t %ws", i, pIfRow->wszName);
printf("\n");
printf("\tDescription[%d]:\t ", i);
for (j = 0; j < (int) pIfRow->dwDescrLen; j++)
printf("%c", pIfRow->bDescr[j]);
printf("\n");
...
Szia,
from http://en.wikipedia.org/wiki/Netstat
On the Windows platform, netstat
information can be retrieved by
calling the GetTcpTable and
GetUdpTable functions in the IP Helper
API, or IPHLPAPI.DLL. Information
returned includes local and remote IP
addresses, local and remote ports, and
(for GetTcpTable) TCP status codes. In
addition to the command-line
netstat.exe tool that ships with
Windows, there are GUI-based netstat
programs available.
On the Windows platform, this command
is available only if the Internet
Protocol (TCP/IP) protocol is
installed as a component in the
properties of a network adapter in
Network Connections.
MFC sample at CodeProject: http://www.codeproject.com/KB/applications/wnetstat.aspx
You might find a feasable WMI performance counter, e.g. Win32_PerfRawData_Tcpip_NetworkInterface.
See Google Groups, original netstats source code has been posted many times (win32 api)
As above answers suggest, WMI performance counters contains some data. Just be aware that in later versions of windows the perf counters are broken down in v4 vs v6 so the queries are:
SELECT * FROM Win32_PerfFormattedData_Tcpip_IPv4
SELECT * FROM Win32_PerfFormattedData_Tcpip_TCPv4
SELECT * FROM Win32_PerfFormattedData_Tcpip_UDPv4
SELECT * FROM Win32_PerfFormattedData_Tcpip_ICMP
SELECT * FROM Win32_PerfFormattedData_Tcpip_IPv6
SELECT * FROM Win32_PerfFormattedData_Tcpip_TCPv6
SELECT * FROM Win32_PerfFormattedData_Tcpip_UDPv6
SELECT * FROM Win32_PerfFormattedData_Tcpip_ICMPv6

Resources