How can I access netstat-like Ethernet statistics from a Windows program - c

How can I access Ethernet statistics from C/C++ code like netstat -e?
Interface Statistics
Received Sent
Bytes 21010071 15425579
Unicast packets 95512 94166
Non-unicast packets 12510 7
Discards 0 0
Errors 0 3
Unknown protocols 0

The WMI will provide those readings:
SELECT * FROM Win32_PerfFormattedData_Tcpip_IP
SELECT * FROM Win32_PerfFormattedData_Tcpip_TCP
SELECT * FROM Win32_PerfFormattedData_Tcpip_UDP
SELECT * FROM Win32_PerfFormattedData_Tcpip_ICMP
SELECT * FROM Win32_PerfFormattedData_Tcpip_Networkinterface
These classes are available on Windows XP or newer. You may have to resign to the matching "Win32_PerfRawData" classes on Windows 2000, and do a little bit more of math before you can display the output.
Find documentation on all of them in the MSDN.

A good place to start for network statistics would be the GetIpStatistics call in the Windows IPHelper functions.
There are a couple of other approaches that are possibly more portable:-
SNMP. Requires SNMP to be enabled on the computer, but can obviously be used to retrieve statistics for remote computers also.
Pipe the output of 'netstat' into your application, and unpick the values from the text.

Let me answer to myself, as I asked the same on another forum.
WMI is good, but it's easier to use IpHlpApi instead:
#include <winsock2.h>
#include <iphlpapi.h>
int main(int argc, char *argv[])
{
PMIB_IFTABLE pIfTable;
MIB_IFROW ifRow;
PMIB_IFROW pIfRow = &ifRow;
DWORD dwSize = 0;
// first call returns the buffer size needed
DWORD retv = GetIfTable(pIfTable, &dwSize, true);
if (retv != ERROR_INSUFFICIENT_BUFFER)
WriteErrorAndExit(retv);
pIfTable = (MIB_IFTABLE*)malloc(dwSize);
retv = GetIfTable(pIfTable, &dwSize, true);
if (retv != NO_ERROR)
WriteErrorAndExit(retv);
// Get index
int i,j;
printf("\tNum Entries: %ld\n\n", pIfTable->dwNumEntries);
for (i = 0; i < (int) pIfTable->dwNumEntries; i++)
{
pIfRow = (MIB_IFROW *) & pIfTable->table[i];
printf("\tIndex[%d]:\t %ld\n", i, pIfRow->dwIndex);
printf("\tInterfaceName[%d]:\t %ws", i, pIfRow->wszName);
printf("\n");
printf("\tDescription[%d]:\t ", i);
for (j = 0; j < (int) pIfRow->dwDescrLen; j++)
printf("%c", pIfRow->bDescr[j]);
printf("\n");
...

Szia,
from http://en.wikipedia.org/wiki/Netstat
On the Windows platform, netstat
information can be retrieved by
calling the GetTcpTable and
GetUdpTable functions in the IP Helper
API, or IPHLPAPI.DLL. Information
returned includes local and remote IP
addresses, local and remote ports, and
(for GetTcpTable) TCP status codes. In
addition to the command-line
netstat.exe tool that ships with
Windows, there are GUI-based netstat
programs available.
On the Windows platform, this command
is available only if the Internet
Protocol (TCP/IP) protocol is
installed as a component in the
properties of a network adapter in
Network Connections.
MFC sample at CodeProject: http://www.codeproject.com/KB/applications/wnetstat.aspx

You might find a feasable WMI performance counter, e.g. Win32_PerfRawData_Tcpip_NetworkInterface.

See Google Groups, original netstats source code has been posted many times (win32 api)

As above answers suggest, WMI performance counters contains some data. Just be aware that in later versions of windows the perf counters are broken down in v4 vs v6 so the queries are:
SELECT * FROM Win32_PerfFormattedData_Tcpip_IPv4
SELECT * FROM Win32_PerfFormattedData_Tcpip_TCPv4
SELECT * FROM Win32_PerfFormattedData_Tcpip_UDPv4
SELECT * FROM Win32_PerfFormattedData_Tcpip_ICMP
SELECT * FROM Win32_PerfFormattedData_Tcpip_IPv6
SELECT * FROM Win32_PerfFormattedData_Tcpip_TCPv6
SELECT * FROM Win32_PerfFormattedData_Tcpip_UDPv6
SELECT * FROM Win32_PerfFormattedData_Tcpip_ICMPv6

Related

RoCE connection problem with MLNX_OFED (RDMA over Converged Ethernet)

I am trying to get RoCe (RDMA over converged ethernet) to work on two workstations. I have installed MLNX_OFED on both computers which are equiped with Mellanox ConnectX-5 EN 100GbE adapters and connected directly to each other via corresponding cable. According to what I've read I need the Subnet Manager to be running on one of the workstations for it to be able to use RoCe between them.
When trying to run the command opensm it says it finds no local ports. I can ping both computers and it works. I can also use the command udaddy to test the RDMA and it also works. But trying to run the RDMA_RC_EXAMPLE presented in this guide https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf it fails when trying to create the Queue Pairs, more specificly when it tries to change state to RTR (ready to receive).
Also some sources says that you need a RDMA service, which does not exists on my computer. And the installation of MLNX_OFED puts an exclude of ibutils-libs* in /etc/yum.conf, which I don't know if it's relevant but I noticed it.
I am running CentOS 7.7 on one of the machines and CentOS 7.8 on the other.
I'm a bit puzzled over what's faulty.
Update
Here's the function that breaks when running the code.
/******************************************************************************
* Function: modify_qp_to_rtr
*
* Input
* qp QP to transition
* remote_qpn remote QP number
* dlid destination LID
* dgid destination GID (mandatory for RoCEE)
*
* Output
* none
*
* Returns
* 0 on success, ibv_modify_qp failure code on failure
*
* Description
* Transition a QP from the INIT to RTR state, using the specified QP number
******************************************************************************/
static int modify_qp_to_rtr (struct ibv_qp *qp, uint32_t remote_qpn, uint16_t dlid, uint8_t * dgid)
{
struct ibv_qp_attr attr;
int flags;
int rc;
memset (&attr, 0, sizeof (attr));
attr.qp_state = IBV_QPS_RTR;
attr.path_mtu = IBV_MTU_256;
attr.dest_qp_num = remote_qpn;
attr.rq_psn = 0;
attr.max_dest_rd_atomic = 1;
attr.min_rnr_timer = 0x12;
attr.ah_attr.is_global = 0;
attr.ah_attr.dlid = dlid;
attr.ah_attr.sl = 0;
attr.ah_attr.src_path_bits = 0;
attr.ah_attr.port_num = config.ib_port;
if (config.gid_idx >= 0)
{
attr.ah_attr.is_global = 1;
attr.ah_attr.port_num = 1;
memcpy (&attr.ah_attr.grh.dgid, dgid, 16);
attr.ah_attr.grh.flow_label = 0;
attr.ah_attr.grh.hop_limit = 1;
attr.ah_attr.grh.sgid_index = config.gid_idx;
attr.ah_attr.grh.traffic_class = 0;
}
flags = IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN |
IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER;
rc = ibv_modify_qp (qp, &attr, flags);
if (rc)
fprintf (stderr, "failed to modify QP state to RTR\n");
return rc;
}
This source RoCE Debug flow Linux says that
RoCE requires an MTU of at least 1024 bytes for net payload.
Which I guess would affect this row in the code:
attr.path_mtu = IBV_MTU_256;
When changing to IBV_MTU_1024 it compiles but gives the same error.
Solved
When using RoCE you need to specify the GID index as explained by #haggai_e. I don't think it's needed when using Infiniband instead of RoCE.
When running the qperf RDMA test we also needed to set the connection manager flag for it to run. I would assume it has the same function as the Subnet Manager that needs to be used when using Infiniband instead of RoCE.

Cannot open USB device with libusb-1.0 in cygwin

I'm trying to interface with a USB peripheral using libusb-1.0 in cygwin.
libusb_get_device_list(...) works fine, I get a list of USB devices. It finds the device with the correct VendorID and ProductID in the device list, but when libusb_open(...) is called with that device, it always fails with the error code LIBUSB_ERROR_NOT_FOUND.
I don't think it's a permission issue, I've tried running this as admin, and there's a separate error code (LIBUSB_ERROR_ACCESS) for that. This same code works with libusb-1.0 in Linux.
unsigned init_usb(int vendor_id, int product_id, int interface_num)
{
int ret = libusb_init(NULL);
if (ret < 0) return CONTROL_ERROR;
libusb_device **devs = NULL;
int num_dev = libusb_get_device_list(NULL, &devs);
libusb_device *dev = NULL;
for (int i = 0; i < num_dev; i++) {
struct libusb_device_descriptor desc;
libusb_get_device_descriptor(devs[i], &desc);
if (desc.idVendor == vendor_id && desc.idProduct == product_id) {
dev = devs[i];
break;
}
}
if (dev == NULL) return CONTROL_ERROR;
libusb_device_handle *devh = NULL;
ret = libusb_open(dev, &devh);
//ret is always -5 here (in cygwin)!
if (ret < 0) return CONTROL_ERROR;
libusb_free_device_list(devs, 1);
return CONTROL_SUCCESS;
}
It turns out this was a kind of driver issue. I had to tell Windows to associate the particular device I'm using with the libusb drivers.
libusb-win32-1.2.6.0 comes with some tools to make that association (although you may need to configure your system to allow the installation of unsigned drivers).
There's one tricky bit. If you just want to associate the device with libusb, you can use the inf-wizard.exe tool to make that association, but that will change the primary association to be with libusb. In my case, the device is a USB Audio Class device (i.e. USB sound card) that also has some libusb functionality. When I used inf-wizard.exe, libusb started working (yay!), but then it stopped working as an audio device.
In my case, I needed to use the install-filter-win.exe tool to install a filter driver for libusb. That allows the device to still show up as a USB Audio device, but also interact with it using libusb.

Getting raw data using libusb

I'm doing reverse engineering about a ultrasound probe on the Linux side. I want to capture raw data from an ultrasound probe. I'm programming with C and using the libusb API.
There are two BULK IN endpoints in the device (2 and 6). The device is sending 2048 bytes data, but it is sending data as 512 bytes with four block.
This picture is data flow on the Windows side, and I want to copy that to the Linux side. You see four data blocks with endpoint 02 and after that four data blocks with endpoint 06.
But there is a problem about timing. The first data block of endpoint 02's and first data block of endpoint 06's are close to each other acoording to time. But in data flow they are not in sequence.
I see that the computer is reading the first data blocks of endpoint 02 and 06. After that, the computer is reading the other three data blocks of endpoint 02 and endpoint 06. But in USB Analyzer, the data flow is being viewed according to the endpoint number. The sequence is different according to time.
On the Linux side, I write code like this:
int index = 0;
imageBuffer2 = (unsigned char *) malloc(2048);
imageBuffer6 = (unsigned char *) malloc(2048);
while (1) {
libusb_bulk_transfer(devh, BULK_EP_2, imageBuffer2, 2048, &actual2, 0);
libusb_bulk_transfer(devh, BULK_EP_6, imageBuffer6, 2048, &actual6, 0);
//Delay
for(index = 0; index <= 10000000; index ++)
{
}
}
So that result is in picture as below
In other words, in my code all reading data is being read in sequence according to time and endpoint number. My result is different from the data flow on the Windows side.
In brief, I have two BULK IN endpoints, and they are starting read data close according to time. How is it possible?
It's not clear to me whether you're using a different method for getting the data on Windows or not, I'm going to assume that you are.
I'm not an expert on libusb by any means, but my guess would be that you are overwriting you data with each call, since you're using the same buffer each time. Try giving your buffer a fixed value before using the transfer method, and then evaluate the result.
If it is the case, I believe something along the lines of the following would also work in C:
imageBuffer2 = (unsigned char *) malloc(2048);
char *imageBuffer2P = imageBuffer2;
imageBuffer6 = (unsigned char *) malloc(2048);
char *imageBuffer6P = imageBuffer6;
int dataRead2 = 0;
int dataRead6 = 0;
while(dataRead2 < 2048 || dataRead6 < 2048)
{
int actual2 = 0;
int actual6 = 0;
libusb_bulk_transfer(devh, BULK_EP_2, imageBuffer2P, 2048-dataRead2, &actual2, 200);
libusb_bulk_transfer(devh, BULK_EP_6, imageBuffer6P, 2048-dataRead6, &actual6, 200);
dataRead2 += actual2;
dataRead6 += actual6;
imageBuffer2P += actual2;
imageBuffer6P += actual6;
usleep(1);
}

List available capture formats

New to V4L, I decided to start using the video4linux2 library in order to capture a frame from my camera in C (I am using the uvcvideo module with a Ricoh Co. camera). I followed several guides and tutorials, and managed to get a running program. My question is mainly about this usual code line :
struct v4l2_format format = {0};
format.fmt.pix.pixelformat = V4L2_PIX_FMT_MJPEG;
// ...
This is where I set the actual video format to use when capturing. As you can see, in this sample, I'm using MJPEG (http://lxr.free-electrons.com/source/include/uapi/linux/videodev2.h#L390). Even though this might be a great format and all, my application will probably require simple RGB formatting, pixel per pixel I guess. For this reason, I tried using RGB format constants such as V4L2_PIX_FMT_RGB24. Now for some reason... v4l2 doesn't like it. I'm guessing this is hardware-related, but I'd like to avoid MJPEG manipulations as much as possible. For testing purposes, I tried using other constants and formats, but no matter what I did, v4l2 kept changing the pixelformat field's value :
xioctl(fd, VIDIOC_S_FMT, &format); // This call succeeds with errno != EINTR.
if(format.fmt.pix.pixelformat != V4L2_PIX_FMT_RGB24){
// My program always enters this block when not using MJPEG.
fprintf(stderr, "Format wasn't accepted by v4l2.");
exit(4);
}
Now my question is : is there a way I could get a list of accepted video formats (and I mean, accepted by my camera/v4l2), from which I could pick something else than MJPEG ? If you think I have to stick with MJPEG, would you recommend me any library allowing me to manipulate it, and eventually, to draw back the capture in a GUI frame ?
Barbarian test code
I used the following trick to test all available formats on my hardware. First, some shell script to get a list of all formats...
grep 'V4L2_PIX_FMT' /usr/include/linux/videodev2.h | grep define | tr '\t' ' ' | cut -d' ' -f2 | sed 's/$/,/g'
... the output of which is used in this C program :
int formats[] = {/* result of above command */};
int i = 0;
struct v4l2_format format = {0};
for(i = 0; i < /* number of lines in previous command output */; i++){
format.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
format.fmt.pix.width = 320;
format.fmt.pix.height = 240;
format.fmt.pix.pixelformat = formats[i];
format.fmt.pix.field = V4L2_FIELD_NONE;
if(xioctl(fd, VIDIOC_S_FMT, &format) != -1 && format.fmt.pix.pixelformat == formats[i])
fprintf(stderr, "Accepted : %d\n", i);
}
This test reveals that only V4L2_PIX_FMT_YUYV and V4L2_PIX_FMT_MJPEG are functional. Any way I could improve this, or is it hardware-related ?
In Linux, command line utility v4l2-ctl displays all of a webcam's natively supported formats -- install it with sudo apt-get install v4l-utils, run it with v4l2-ctl -dX --list-formats-ext where X is the camera index as in /dev/videoX. These formats are reported to the v4l2 kernel module via uvcvideo module and they are supported natively by the webcam chipset. Only the listed formats are supported by v4l2, anything else would need to be coded by the user, and RGBs are very seldom provided, despite virtually all CCDs working in Bayer RGGB. The most common formats by far are YUV422 (YUYV or YUY2) and MJPEG, with a certain overlap: MJPEG achieve larger frame rates for large resolutions.
C++ code for listing the camera formats can be found in Chromium GetDeviceSupportedFormats() implementation for Linux here.
If you have to plug code to convert YUV to RGB I'd recommend libyuv which has been optimized for plenty of architectures.
In order to enumerate the available format you can use ioctl VIDIOC_ENUM_FMT
To print descriptions of capture format supported by /dev/video0 you can process like this :
#include <string.h>
#include <stdio.h>
#include <fcntl.h>
#include <linux/videodev2.h>
int main()
{
int fd = v4l2_open("/dev/video0", O_RDWR);
if (fd != -1)
{
struct v4l2_fmtdesc fmtdesc;
memset(&fmtdesc,0,sizeof(fmtdesc));
fmtdesc.type = V4L2_BUF_TYPE_VIDEO_CAPTURE;
while (ioctl(fd,VIDIOC_ENUM_FMT,&fmtdesc) == 0)
{
printf("%s\n", fmtdesc.description);
fmtdesc.index++;
}
v4l2_close(fd);
}
}

CUDA kernel launch parameters explained right?

Here I tried to self-explain the CUDA launch parameters model (or execution configuration model) using some pseudo codes, but I don't know if there were some big mistakes, So hope someone help to review it, and give me some advice. Thanks advanced.
Here it is:
/*
normally, we write kernel function like this.
note, __global__ means this function will be called from host codes,
and executed on device. and a __global__ function could only return void.
if there's any parameter passed into __global__ function, it should be stored
in shared memory on device. so, kernel function is so different from the *normal*
C/C++ functions. if I was the CUDA authore, I should make the kernel function more
different from a normal C function.
*/
__global__ void
kernel(float *arr_on_device, int n) {
int idx = blockIdx.x * blockDIm.x + threadIdx.x;
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
/*
after this definition, we could call this kernel function in our normal C/C++ codes !!
do you feel something wired ? un-consistant ?
normally, when I write C codes, I will think a lot about the execution process down to
the metal in my mind, and this one...it's like some fragile codes. break the sequential
thinking process in my mind.
in order to make things normal, I found a way to explain: I expand the *__global__ * function
to some pseudo codes:
*/
#define __foreach(var, start, end) for (var = start, var < end; ++var)
__device__ int
__indexing() {
const int blockId = blockIdx.x * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
return
blockId * (blockDim.x * blockDim.y * blockDim.z) +
threadIdx.z * (blockDim.x * blockDim.y) +
threadIdx.x;
}
global_config =:
{
/*
global configuration.
note the default values are all 1, so in the kernel codes,
we could just ignore those dimensions.
*/
gridDim.x = gridDim.y = gridDim.z = 1;
blockDim.x = blockDim.y = blockDim.z = 1;
};
kernel =:
{
/*
I thought CUDA did some bad evil-detail-covering things here.
it's said that CUDA C is an extension of C, but in my mind,
CUDA C is more like C++, and the *<<<>>>* part is too tricky.
for example:
kernel<<<10, 32>>>(); means kernel will execute in 10 blocks each have 32 threads.
dim3 dimG(10, 1, 1);
dim3 dimB(32, 1, 1);
kernel<<<dimG, dimB>>>(); this is exactly the same thing with above.
it's not C style, and C++ style ? at first, I thought this could be done by
C++'s constructor stuff, but I checked structure *dim3*, there's no proper
constructor for this. this just brroke the semantics of both C and C++. I thought
force user to use *kernel<<<dim3, dim3>>>* would be better. So I'd like to keep
this rule in my future codes.
*/
gridDim = dimG;
blockDim = dimB;
__foreach(blockIdx.z, 0, gridDim.z)
__foreach(blockIdx.y, 0, gridDim.y)
__foreach(blockIdx.x, 0, gridDim.x)
__foreach(threadIdx.z, 0, blockDim.z)
__foreach(threadIdx.y, 0, blockDim.y)
__foreach(threadIdx.x, 0, blockDim.x)
{
const int idx = __indexing();
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
};
/*
so, for me, gridDim & blockDim is like some boundaries.
e.g. gridDim.x is the upper bound of blockIdx.x, this is not that obvious for people like me.
*/
/* the declaration of dim3 from vector_types.h of CUDA/include */
struct __device_builtin__ dim3
{
unsigned int x, y, z;
#if defined(__cplusplus)
__host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
__host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
__host__ __device__ operator uint3(void) { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif /* __cplusplus */
};
typedef __device_builtin__ struct dim3 dim3;
CUDA DRIVER API
The CUDA Driver API v4.0 and above uses the following functions to control a kernel launch:
cuFuncSetCacheConfig
cuFuncSetSharedMemConfig
cuLaunchKernel
The following CUDA Driver API functions were used prior to the introduction of cuLaunchKernel in v4.0.
cuFuncSetBlockShape()
cuFuncSetSharedSize()
cuParamSet{Size,i,fv}()
cuLaunch
cuLaunchGrid
Additional information on these functions can be found in cuda.h.
CUresult CUDAAPI cuLaunchKernel(CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void **kernelParams,
void **extra);
cuLaunchKernel takes as parameters the entire launch configuration.
See NVIDIA Driver API[Execution Control]1 for more details.
CUDA KERNEL LAUNCH
cuLaunchKernel will
1. verify the launch parameters
2. change the shared memory configuration
3. change the local memory allocation
4. push a stream synchronization token into the command buffer to make sure two commands in the stream do not overlap
4. push the launch parameters into the command buffer
5. push the launch command into the command buffer
6. submit the command buffer to the device (on wddm drivers this step may be deferred)
7. on wddm the kernel driver will page all memory required in device memory
The GPU will
1. verify the command
2. send the commands to the compute work distributor
3. dispatch launch configuration and thread blocks to the SMs
When all thread blocks have completed the work distributor will flush the caches to honor the CUDA memory model and it will mark the kernel as completed so the next item in the stream can make forward progress.
The order that thread blocks are dispatched differs between architectures.
Compute capability 1.x devices store the kernel parameters in shared memory.
Compute capability 2.0-3.5 devices store the kenrel parameters in constant memory.
CUDA RUNTIME API
The CUDA Runtime is a C++ software library and build tool chain on top of the CUDA Driver API. The CUDA Runtime uses the following functions to control a kernel launch:
cudaConfigureCall
cudaFuncSetCacheConfig
cudaFuncSetSharedMemConfig
cudaLaunch
cudaSetupArgument
See NVIDIA Runtime API[Execution Control]2
The <<<>>> CUDA language extension is the most common method used to launch a kernel.
During compilation nvcc will create a new CPU stub function for each kernel function called using <<<>>> and it will replace the <<<>>> with a call to the stub function.
For example
__global__ void kernel(float* buf, int j)
{
// ...
}
kernel<<<blocks,threads,0,myStream>>>(d_buf,j);
generates
void __device_stub__Z6kernelPfi(float *__par0, int __par1){__cudaSetupArgSimple(__par0, 0U);__cudaSetupArgSimple(__par1, 4U);__cudaLaunch(((char *)((void ( *)(float *, int))kernel)));}
You can inspect the generated files by adding --keep to your nvcc command line.
cudaLaunch calls cuLaunchKernel.
CUDA DYNAMIC PARALLELISM
CUDA CDP works similar to the CUDA Runtime API described above.
By using <<<...>>>, you are launching a number of threads in the GPU. These threads are grouped into blocks and forms a large grid. All the threads will execute the invoked kernel function code.
In the kernel function, build-in variables like threadIdx and blockIdx enable the code know which thread it runs and do the scheduled part of the work.
edit
Basically, <<<...>>> simplifies the configuration procedure to launch a kernel. Without using it, one may have to call 4~5 APIs for a single kernel launch, just as the OpenCL way, which use only C99 syntax.
In fact you could check CUDA driver APIs. It may provide all those APIs so you don't need to use <<<>>>.
Basically, the GPU is divided into separate "device" GPUs (e.g. GeForce 690 has 2) -> multiple SM's (streaming multiprocessors) -> multiple CUDA cores. As far as I know, the dimensionality of a block or grid is just a logical assignment irrelevant of hardware, but the total size of a block (x*y*z) is very important.
Threads in a block HAVE TO be on the same SM, to use its facilities of shared memory and synchronization. So you cannot have blocks with more threads than CUDA cores are contained in a SM.
If we have a simple scenario where we have 16 SMs with 32 CUDA cores each, and we have 31x1x1 block size, and 20x1x1 grid size, we will forfeit at least 1/32 of the processing power of the card. Every time a block is run, a SM will have only 31 of its 32 cores busy. Blocks will load to fill up the SMs, we will have 16 blocks finish at roughly the same time, and as the first 4 SMs free up, they will start processing the last 4 blocks (NOT necessarily blocks #17-20).
Comments and corrections are welcome.

Resources