web-gRPC Performance Rate per second - benchmarking

I want to develop a system for trace and debugging an external device via COM port.
Main service will be developed using python to receive, analyse & store logs data.
We decided to stream log data to web browser with gRPC protocol and draw live charts.
Highest rate if data is 50K of signals per second and maximum size of every signal is just 10 bytes.
System will be used in local network or same PC so we do not have bandwidth limits.
We want to make sure the web-grpc platform can cover this rate per second.
Thanks for your recommendations.

The throughput limit is mostly decided by the browser and the protobuf overhead. Since the latter is application specific, you should do a benchmark with real data on your preferred browsers.


How do I increase the speed of my USB cdc device?

I am upgrading the processor in an embedded system for work. This is all in C, with no OS. Part of that upgrade includes migrating the processor-PC communications interface from IEEE-488 to USB. I finally got the USB firmware written, and have been testing it. It was going great until I tried to push through lots of data only to discover my USB connection is slower than the old IEEE-488 connection. I have the USB device enumerating as a CDC device with a baud rate of 115200 bps, but it is clear that I am not even reaching that throughput, and I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong. I control every aspect of this from the front end on the PC to the firmware on the embedded system.
I am assuming my issue is how I write to the USB on the embedded system side. Right now my USB_Write function is run in free time, and is just a while loop that writes one char to the USB port until the write buffer is empty. Is there a more efficient way to do this?
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications. The CPU would just write data across a bus to this board, and it would handle communications, which means that the CPU didn't have to waste free time handling the actual communications, but could offload the communications to a "co processor" (not a CPU but functionally the same here). Even with this concern though I figured I should be getting faster speeds given that full speed USB is on the order of MB/s while IEEE-488 is on the order of kB/s.
In short is this more likely a fundamental system constraint or a software optimization issue?
I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong.
You are correct, the baud number is a dummy value. If you create a CDC/RS232 adapter you would use this to configure your RS232 hardware, in this case it means nothing.
Is there a more efficient way to do this?
Absolutely! You should be writing chunks of data the same size as your USB endpoint for maximum transfer speed. Depending on the device you are using your stream of single byte writes may be gathered into a single packet before sending but from my experience (and your results) this is unlikely.
Depending on your latency requirements you can stick in a circular buffer and only issue data from it to the USB_Write function when you have ENDPOINT_SZ number of byes. If this results in excessive latency or your interface is not always communicating you may want to implement Nagles algorithm.
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications.
The NXP part you mentioned in the comments is without a doubt fast enough to saturate a USB full speed connection.
In short is this more likely a fundamental system constraint or a software optimization issue?
I would consider this a software design issue rather than an optimisation one, but no, it is unlikely you are fundamentally stuck.
Do take care to figure out exactly what sort of USB connection you are using though, if you are using USB 1.1 you will be limited to 64KB/s, USB 2.0 full speed you will be limited to 512KB/s. If you require higher throughput you should migrate to using a separate bulk endpoint for the data transfer.
I would recommend reading through the USB made simple site to get a good overview of the various USB speeds and their capabilities.
One final issue, vendor CDC libraries are not always the best and implementations of the CDC standard can vary. You can theoretically get more data through a CDC endpoint by using larger endpoints, I have seen this bring host side drivers to their knees though - if you go this route create a custom driver using bulk endpoints.
Try testing your device on multiple systems, you may find you get quite different results between windows and linux. This will help to point the finger at the host end.
And finally, make sure you are doing big buffered reads on the host side, USB will stop transferring data once the host side buffers are full.

Effect of network transfer over cpu

I have a project for which i need to minimize the impact of sending / receiving computed data over the network.
In my configuration, a (small) grid of computers will compute a large number of values (matrix_i^n, each machine having a large set of i's assigned). These values will then be send over the network to another computer depending on properties of the computed value (on average, every computer receives the same number of values).
I would like to optimize the time needed to compute these values (up to a power m, predetermined). In oder to do this, i need to choose the best way to transfer the intermediary results:
Precompute everything then exchange all the values to the right computer
Send every value to the right computer as soon as it is available
Hybrid solution where small packs of data are exchanged during the computation
Since network transfers are very slow, I have the feeling that I should start transferring data asap but i'm not sure that the overhead on the CPU (handling more exceptions, hence more work for the scheduler) would not blow the performance of the computation.
Do you know documentation i could rely on or a good benchmark suite (written in C) i could use to make some test by myself ?
Thank you
CPU usage for networking is generally determined by the amount of I/O calls you make. If you can, design your app in a way that lets you tweak your buffer size easily so that you can test. Using enough buffering, 10gb/s is no sweat.
What OS are you using? I doubt you'll need it, but Windows 8 has Registered I/O which is designed for extremely low latency and CPU usage.

How to get WIFI parameters (bandwidth, delay) on Ubuntu using C

I am student and I am writting simple application in C99 standard. Program should working on Ubuntu.
I have one problem - I don't know how can I get some Wifi parameters like bandwidth or delay. I haven't any idea how to do this. It is possible to do this using standard functions or any linux API (ech I am windows user)?.
In general, you don't know the bandwidth or delay of a wifi device.
Bandwidth and delay is the type of information from a link.
As far as I know, there is no such information holding in WiFi drivers.
The most link-related information is SINR.
For measuring bandwidth or delay, you should write your own code.
Maybe you should tell us more about your concrete problem. For now, I assume that you are interested in the throughput and latency of a specific wireless link, i.e. a link between two 802.11 stations. This could be a link between an access point and a client or between two ad-hoc stations.
The short answer is that there is no such API. In fact, it is not trivial even to estimate these two link parameters. They depend on the signal quality, on the data rate used by the sending station, on the interference, on the channel utilization, on the load of the computer systems at both ends, and probably a lot of other factors.
Depending on the wireless driver you are using it may be possible to obtain information about the currently used data rate and some packet loss statistics for the station you are communicating with. Have a look at net/mac80211/sta_info.h in your Linux kernel source tree. If you are using MadWifi, you may find useful information in the files below /proc/net/madwifi/ath0/ and in the output of wlanconfig ath0 list sta.
However, all you can do is to make a prediction. If the link quality changes suddenly, your prediction may be entirely wrong.

multiple or single requests per udp packet?

I am developing my own protocol over UDP (under Linux) for a cache application (similar to memcached) which only executes INSERT/READ/UPDATE/DELETE operations on an object and I am not sure which design would be the best:
Send one request per packet. (client prepares the request and sends it to the server immediately)
Send multiple requests per packet. (client enqueues the requests in a packet and when it is full (close to the MTU size) sends it to the server)
The size of the request (i.e. the record data) can be from 32 bytes to 1400 bytes, I don't know which will it be on average, it entirely depends on the user's application.
If choose single request per packet, I will have to manage a lot of small packets and the kernel will be interruped a lot of times. This will slow the operation since the kernel must save registers when switching from user space to system. Also there will be overhead in data transmition, if user's application sends many requests of 32 bytes (the packet overhead for udp is about 28 bytes) network traffic will double and I will have big impact on transmission speed. However high network traffic not necessarily implies low performance since the NIC has its own processor and does not makes the CPU stall. Additional network card can be installed in case of a network bottleneck.
The big advantage for using single packet is that the server and client will be so simple that I will save on instructions and gain on speed, at the same time I will have less bugs and the project will be finished earlier.
If I use multiple requests per packet, I will have fewer but bigger packets and therefore more data could be transmitted over the network. I will have reduced number of system calls but the complexity of the server will require more memory and more instructions to be executed so it is unknown if we get faster execution doing it this way. It may happen that the CPU will be the bottleneck, but what is cheaper, to add a CPU or a network card?
The application should have heavy data load, like 100,000 requests per second on lastest CPUs. I am not sure which way to do it. I am thinking to go for 'single request per packet', but before I rewrite all the code I already wrote for multiple request handling I would like to ask for recommendations.
Thanks in advance.
What do you care about more: latency or bandwidth?
If latency, send the request as soon as possible even if that means a lot of "slack" at the ends of packets and more packets overall.
If bandwidth, bundle multiple requests to eliminate the "slack" and send fewer packets overall.
NOTE: The network, not the CPU, will likely be your major bottleneck in either case, unless you are running over an extremely fast network. And even if you do, the INSERT/READ/UPDATE/DELETE in the database will likely spend more CPU and I/O than the CPU work needed for packets.
Another trade off to sending multiple requests per packet is that
On the one hand, the unreliable nature of UDP may cause you to drop multiple requests at at time, thus making re-transmissions more expensive.
On the other hand, the kernel will be using fewer buffers to deliver your data, reducing the chances of data drops
However, the analysis is incomplete without an understanding of the deployment architecture, such as the buffer sizes of the NICs, switches, and routers, and other networking hardware.
But the recommendation is to start with a relatively simple implementation (single request per packet), but write the code in such a way so that it will not be too difficult to add more complexity if needed.

Maintaining connections with many real-time devices

I'm writing a program on Linux to control about 1000 Patient Monitors at same time over UDP sockets. I've successfully written a library to parse and send messages to collect the data from a single patient monitor device. There are various scheduling constraints on the the device, listed below:-
Each device must constantly get an alive-request from computer client within max time-period of 300 milliseconds(may differ for different devices), otherwise connection is lost.
Computer client must send a poll-request to a device in order fetch the data within some time period. I'm polling for about 5 seconds of averaged data from patient monitor, therefore, I'm required to send poll-request in every 5 * 3 = 15 seconds. If I fail to send the request within 15 seconds time-frame, I looses the connection from device.
Now, I'm trying to extend my current program so that it is capable of handling about 1000+ devices at same time. Right now, my program can efficiently handle and parse response from just one device. In case of handling multiple devices, it is necessary to synchronize multiple responses from different device and serialize them and stream it over TCP socket, so that remote computers can also analyze the data. Well, that is not a problem because it is a well know multiple-producer and single consumer problem. My main concern is, what approach should I use in order to maintain alive-connection 1000+ devices.
After reading over Internet and browsing for similar questions on this website, I'm mainly considering two options:-
Use one thread per device. In order to control 1000+ device, I would end up in making 1000+ threads which does not look feasible to me.
Use multiplexing approach, selecting FD that requires attention and deal with it one at a time. I'm not sure how would I go about it and if multiplexing approach would be able to maintain alive-connection with all the devices considering above two constants.
I need some suggestions and advice on how to deal with this situation where you need to control 1000+ real-time-device over UDP sockets. Each device requires some alive-signal every 300 milliseconds (differ for different devices) and they require poll request in about 3 times the time interval mentioned during association phase. For example, patient monitors in ICU may require real-time (1 second averaged) data where as patient monitors in general wards may require 10-seconds averaged data, therefore, poll period for two devices would be 3*1(3 seconds) and 3*10 (30 seconds) respectively.
Shivam Kalra
for the most part either approach is at least functionally capable of handling the functionality you describe, but by the sounds of things performance will be a crucial issue. From the figures you have provided it seems that the application could be CPU-buond.
A multithreaded approach has the advantage of using all of the available CPU cores on the machine, but multithreaded programs are notorious for being difficult to make reliable and robust.
You could also use the Apache's old tried-and-true forked-worker model - create, say, a separate process to handle a maximum of 100 devices. You could then need to write code to manage the mapping of connections to processes.
You could also use multiple hosts and some mechanism to distribute devices among them. This would have the advantage of making it easier to handle recovery situations. It sounds like your application could well be mission critical, and it may need to be architected so that if any one piece of hardware breaks then other hardware will take over automatically.