How expensive is opening a HANDLE? - c

I am currently writing a time-sensitive application, and it got me thinking: How expensive is opening/closing a handle (in my case a COM port) compared to reading/writing from the handle?
I know the relative cost of other operations (like dynamic allocation vs. stack allocation), but I haven't found anything in my travels about this.

There isn't an unique answer, specially in case of devices. In general, the "open" operation (CreateFile) involves more work by the device driver. Device drivers are inclined to do the most work they can do at the initialization/opening in order to optimize subsequent read/write operations. Moreover, many devices may require a long setup. E.g. the "classic" serial driver takes much time to initialize the baudrate prescaler and the handshake signals. Instead, when the device is open and ready, the read and write operations are usually quite fast. But this is just a hint, it depends on the particular driver you are using (traditional COM? USB converter? The drivers are very different). I recommend you an investigation by a profiler.

Related

PCIe bus latency when using ioctl vs read?

I've got a hardware client1 who's line of data acquisition cards I've written a Linux PCI kernel driver for.
The card can only communicate 1-4 bytes at a time depending on how the user specifies to utilize it, given this, I utilize ioctl for some of the functionality, but also make use of the file_operations structure to treat the card as a basic character device to give the user of the card the ability to just use read or write if they just want simple 1-byte communication with the card.
After discussing the driver with the client, one of their developers is in the understanding that treating the card as a character device by using open/read/write will introduce latency on the PCI bus, versus using open/ioctl.
Given that the driver makes no distinction how it's opened and the ioctl and read/write functions call the same code, is there any validity to this concern?
If so, how would I test the bus latency from my driver code? Are there kernel functions I can call to test this?
Lastly, wouldn't my test of the bus only be valid for my specific setup (kernel settings, platform, memory timing, CPU, etc.)?
1: they only have 2 other developers, neither of which have ever used Linux
I suspect the client's developer is slightly confused. He's thinking that the distinction between using read or write versus ioctl corresponds to the type of operation performed on the bus. If you explain to him that this is just a software API difference and either option performs the exact same operation on the bus, that should satisfy them.

How to interrupt an user-space app from a Kernel driver?

I am writing a device driver that receives interrupts from hardware. (32 IRQs using MSI)
From the driver, I'd like to signal/interrupt the application that opened the device file that an event occured.
I might be able to use signal but I think it's not really reliable and too slow. Moreover, only 2 SIGUSR are available.
I'd like to avoid adding overhead.
I'd like to avoid them because:
signal: not enough reliable and high latency
netlink: high latency, asynchronous and may loose packets
polling/read/ioctl: need to use a pthread and an infinity loop
Currently, I exchange data using ioctl/read/write syscalls.read/write syscalls.
What is the best practice to interrupt/signal an event to an user-space application from a kernel driver?
The method should support many interrupts/signals without loosing any of them, it has to be reliable and fast.
Basically, I'd like to use my user-space app as bottom half of the interrupts I receive in the driver.
The device file is opened by a unique app.

How do I increase the speed of my USB cdc device?

I am upgrading the processor in an embedded system for work. This is all in C, with no OS. Part of that upgrade includes migrating the processor-PC communications interface from IEEE-488 to USB. I finally got the USB firmware written, and have been testing it. It was going great until I tried to push through lots of data only to discover my USB connection is slower than the old IEEE-488 connection. I have the USB device enumerating as a CDC device with a baud rate of 115200 bps, but it is clear that I am not even reaching that throughput, and I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong. I control every aspect of this from the front end on the PC to the firmware on the embedded system.
I am assuming my issue is how I write to the USB on the embedded system side. Right now my USB_Write function is run in free time, and is just a while loop that writes one char to the USB port until the write buffer is empty. Is there a more efficient way to do this?
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications. The CPU would just write data across a bus to this board, and it would handle communications, which means that the CPU didn't have to waste free time handling the actual communications, but could offload the communications to a "co processor" (not a CPU but functionally the same here). Even with this concern though I figured I should be getting faster speeds given that full speed USB is on the order of MB/s while IEEE-488 is on the order of kB/s.
In short is this more likely a fundamental system constraint or a software optimization issue?
I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong.
You are correct, the baud number is a dummy value. If you create a CDC/RS232 adapter you would use this to configure your RS232 hardware, in this case it means nothing.
Is there a more efficient way to do this?
Absolutely! You should be writing chunks of data the same size as your USB endpoint for maximum transfer speed. Depending on the device you are using your stream of single byte writes may be gathered into a single packet before sending but from my experience (and your results) this is unlikely.
Depending on your latency requirements you can stick in a circular buffer and only issue data from it to the USB_Write function when you have ENDPOINT_SZ number of byes. If this results in excessive latency or your interface is not always communicating you may want to implement Nagles algorithm.
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications.
The NXP part you mentioned in the comments is without a doubt fast enough to saturate a USB full speed connection.
In short is this more likely a fundamental system constraint or a software optimization issue?
I would consider this a software design issue rather than an optimisation one, but no, it is unlikely you are fundamentally stuck.
Do take care to figure out exactly what sort of USB connection you are using though, if you are using USB 1.1 you will be limited to 64KB/s, USB 2.0 full speed you will be limited to 512KB/s. If you require higher throughput you should migrate to using a separate bulk endpoint for the data transfer.
I would recommend reading through the USB made simple site to get a good overview of the various USB speeds and their capabilities.
One final issue, vendor CDC libraries are not always the best and implementations of the CDC standard can vary. You can theoretically get more data through a CDC endpoint by using larger endpoints, I have seen this bring host side drivers to their knees though - if you go this route create a custom driver using bulk endpoints.
Try testing your device on multiple systems, you may find you get quite different results between windows and linux. This will help to point the finger at the host end.
And finally, make sure you are doing big buffered reads on the host side, USB will stop transferring data once the host side buffers are full.

linux -c - notify the kernel from userspace as fast as possible and vice versa

Context :
Debian 64 bits.
Making a linux-only userspace networking stack that I may release open source.
Everything is ready but one last thing.
The problem :
I know about poll/select/epoll and use them heavily already but they are too complicated for my need and tend to add latency (few nanoseconds -> too much).
The need :
A simple mean to notify from the kernel to an application that packets are to be processed and the reverse with a shared mmap file operating as a multi-ring buffer. It would obviously not incur a context-switch.
I wrote a custom driver for my NIC (and plan to create others for the big league -> 1-10Gb).
I would like two shared arrays of int and two shared arrays of char. I have the multiprocess and non blocking design already working.
A peer (int and char) for the kernel -> app direction; another for app -> kernel.
But how to notify at the very moment mmap has changed. I read a msync would do it but it is slow too. That is my problem.
Mutexes lead to dead slow code. Spinlocks tend to waste cpu cycles on overload.
Not talking about a busy while(1) loop always reading -> cpu cycles waste.
What do you recommend ?
It is my last step.
Thanks
Update:
I think i will have to pay the latency of setting the interrupt mask anyway. So it should ideally be amortized by the number of incoming packets during that required latency. The first few packets after a burst will always be slower i guess since i obviously don't infinite loop.
The worst case would be that packets are sparse to come (hence why seeking saturating link performance at the first place). That worts case will be met at times. But who cares, it is still faster than the stock kernel anyway. Trade-offs trade-offs :)
It seems like you are taking approaches which are common to networking in embedded systems based on RTOS.
In Linux you are not supposed to write your own network stack - Linux kernel already have a good network stack. You are just expected to implement a NIC device driver (in kernel) which hands over all the packets for processing by the Linux network stack.
Any Linux network related components are always in the kernel - and the problems you describe provide some explanation why that is essential for reasonable performance.
The only exception is userspace network filters (e.g., for firewalls) that may be hooked to the iptables mechanism - and those incur higher latencies on packets that routed through them.

Reading a 4 µs long +5V TTL from a parallel port -- when to use kernel interrupts

I've got an experimental box of tricks running that, every 100 ms or so, will spit out a 4 microsecond long +5V pulse of electricity on a TTL line. The exact time that this happens is not known ahead of time, but it's important -- so I'd like to use the Red Hat 5.3 computer that essentially runs the experiment to service this TTL, and create a glorified timestamp.
At the moment, what I've done is wired the TTL into pin 13 of the parallel port (STATUS_SELECT, one of the input lines on a parallel port) on the linux box, spawn a process when the experiment starts, use chrt to change its scheduled priority to 99 -- i.e. high -- and then just poll the parallel port repeatedly in a while loop until the pin goes high. I then create an accurate timestamp, and, in a non-blocking way write it to disk.
Obviously, this is inefficient -- sometimes the process is suspended, and a TTL will be missed. As the computer is, itself, busy doing other things (namely acquiring data from my experimental bit of kit -- an MRI scanner!) this happens quite often. Polling is easy, but probably bad.
My question is this: doing something quickly when a TTL occurs seems like the bread-and-butter of computing, but, as far as I can tell, it's only possible to deal with interrupts on linux if you're a kernel module. The parallel port can generate interrupts, and libraries like paraport let you build kernel modules relatively quickly, where you have to supply your own handler.
Is the best way to deal with this problem and create accurate (±25 ms) timestamps for an experiment whenever that TTL comes in -- to write a kernel module that provides a list of recent interrupts to somewhere in /proc, and then read them out with a normal process later? Is that approach not going to work, and be very CPU inefficient -- or open a bag of worms to do with interrupt priority I'm not aware of?
Most importantly, this seems like it should be a solved problem -- is it, and if so do any wise people wish to point me in the right direction? Writing a kernel module seems like, frankly, a lot of hard, risky work for something that feels as if it should perhaps be simple.
The premise that "it's only possible to deal with interrupts on linux if you're a kernel module" dismisses some fairly common and effective strategies.
The simple course of action for responding to interrupts in userspace (especially infrequent ones) is to have a driver which created a kernel device (or in some cases sysfs node) where either a read() or perhaps a custom ioctl() from userspace will block until the interrupt occurs. You'd have to check if the default parallel port driver supports this, but it's extremely common with the GPIO drivers on embedded-type boards, and the basic scheme could be borrowed into the parallel port - provided that the hardware supports true interrupts.
If very precise timing is the goal, you might do better to customize the kernel module to record the timestamp there, and implement a mechanism where a read() from userspace blocks until the interrupt occurs, and then obtains the kernel's already recorded timestamp as the read data - thus avoiding the variable latency of waking userspace and calling back into the kernel to get the time.
You might also look at true local-bus serial ports (if present) as an alternate-interrupt capable interface in cases where the available parallel port is some partial or indirect implementation which doesn't support them.
In situations where your only available interface is something indirect and high latency such as USB, or where you want a lot of host- and operation-system- independence, then it may indeed make sense to use an external microcontroller. In that case, you would probably try to set the micro's clock from the host system, and then have it give you timestamp messages every time it sees an event. If your experiment only needs the timestamps to be relative to each other within a given experimental session, this should work well. But if you need to establish an absolute time synchronization across the USB latency, you may have to do some careful roundtrip measurement and then estimation of the latency in order to compensate it (see NTP for an extreme example).

Resources