network performance tunning in Linux

network performance tunning in Linux - c

I have a application which has two threads ,thread1 would receive multicast packages
from network card eth1 , suppose I use sched_setaffinity to set cpu affinity
for thread1 to cpu core 1 , and then I have thread2 to use these packages
(received from thread1,located in heap area global vars) to do some operations ,
I set cpu affinity for thread2 to core 7 ,suppose core 1 and core 7 are
in the same core with hyper-threading , I think the performance would be good,
since core 1 and core 7 can use L1 cache .
I have watched /proc/interrupt , I see eth1 has interrupts in several cpu cores ,
so in my case , I set cpu affinity to core 1 for thread1 ,but interrupts happened
in many cores , would it effect performance ? those packages received from eth1
would go directly to main memory no matter which core has the interrupt ?
I don't know much about network in linux kernel , may anyone who would suggest
books or websites can help me for this topic ? Thanks for any comments ~~
Edit : according to "What every programmer should know about memory" 6.3.5 "Direct Cache Access" , I think "DCA" is hwat i like to know ...

The interrupt will (quite likely) happen on a different core than the one receiving the packet. Depending on how the driver deals with packets, that may or may not matter. If the driver reads the packet (e.g. to makle a copy), then it's not ideal, as the cache gets filled on a different CPU. But if the packet is just loaded into memory somewhere using DMA, and left there for the software to pick up later, then it doesn't matter [in fact, it's better to have it happen on a different CPU, as "your" cpu gets more time to do other things].
As to using hyperthreading, my experience (and that of many others) is that hyperthreading SOMETIMES gives a benefit, but often ends up being similar to not having hyperthreading, because the two threads use the same execution units of the same core. You may want to compare the throughput with both threads set to affinity on the same core as well, to see if that makes it "better" or "worse" - like most things, it's often details that make a difference, so your code may be slightly different from someone elses, meaning it works better in one or the other of the cases.
Edit: if your system has multiple sockets, you may also want to ensure that the CPU on the socket "nearest" (as in the number of QPI/PCI bridge hops) the network card.

Related

PMU for multi threaded environment

I am planning to measure PMU counters for L1,L2,L3 misses branch prediction misses , I have read related Intel documents but i am unsure about the below scenarios.could some one please clarify ?
//assume PMU reset and PERFEVTSELx configurtion done above
ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_start) //PMU start counters
my_program();
ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_stop) ///PMU stop
//now reading PMU counters
1.what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
2.what will happen if process scheduled out and schedule back to same core again, meanwhile some other process reset the PMU counters?
How to make sure that we are reading the correct values from PMU counters.?
Machine details:CentOS with Linux kernel 3.10.0-327.22.2.el7.x86_64 , which is powered up with Intel(R) Core(TM) i7-3770 CPU # 3.40GHz
Thanks

Summary of the Intel forum thread started by the OP:
The Linux perf subsystem virtualizes the performance counters, but this means you have to read them with a system call, instead of rdpmc, to get the full virtualized 64-bit value instead of whatever is currently in the architectural performance counter register.
If you want to use rdpmc inside your own code so it can measure itself, pin each thread to a core because context switches don't save/restore PMCs. There's no easy way to avoid measuring everything that happens on the core, including interrupt handlers and other processes that get a timeslice. This can be a good thing, since you need to take the impact of kernel overhead into account.
More useful quotes from John D. McCalpin, PhD ("Dr. Bandwidth"):
For inline code instrumentation you should be able to use the "perf events" API, but the documentation is minimal. Some resources are available at http://web.eece.maine.edu/~vweaver/projects/perf_events/faq.html
You can use "pread()" on the /dev/cpu/*/msr device files to read the
MSRs -- this may be a bit easier to read than IOCTL-based code. The
codes "rdmsr.c" and "wrmsr.c" from "msr-tools-1.3" provide excellent
examples.
There have been a number of approaches to reserving and sharing
performance counters, including both software-only and combined
hardware+software approaches, but at this point there is not a
"standard" approach. (It looks like Intel has a hardware-based
approach using MSR 0x392 IA32_PERF_GLOBAL_INUSE, but I don't know what
platforms support it.)
your questions
what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
You'll see random garbage, same if another process resets PMCs between timeslices of your process.

i got the answers from some Intel forum, the link is below.
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/673602

What could be delaying my select() call?

I have a small program running on Linux (on an embedded PC, dual-core Intel Atom 1.6GHz with Debian 6 running Linux 2.6.32-5) which communicates with external hardware via an FTDI USB-to-serial converter (using the ftdi_sio kernel module and a /dev/ttyUSB* device). Essentially, in my main loop I run
clock_gettime() using CLOCK_MONOTONIC
select() with a timeout of 8 ms
clock_gettime() as before
Output the time difference of the two clock_gettime() calls
To have some level of "soft" real-time guarantees, this thread runs as SCHED_FIFO with maximum priority (showing up as "RT" in top). It is the only thread in the system running at this priority, no other process has such priorities. My process has one other SCHED_FIFO thread with a lower priority, while everything else is at SCHED_OTHER. The two "real-time" threads are not CPU bound and do very little apart from waiting for I/O and passing on data.
The kernel I am using has no RT_PREEMPT patches (I might switch to that patch in the future). I know that if I want "proper" realtime, I need to switch to RT_PREEMPT or, better, Xenomai or the like. But nevertheless I would like to know what is behind the following timing anomalies on a "vanilla" kernel:
Roughly 0.03% of all select() calls are timed at over 10 ms (remember, the timeout was 8 ms).
The three worst cases (out of over 12 million calls) were 31.7 ms, 46.8 ms and 64.4 ms.
All of the above happened within 20 seconds of each other, and I think some cron job may have been interfering (although the system logs are low on information apart from the fact that cron.daily was being executed at the time).
So, my question is: What factors can be involved in such extreme cases? Is this just something that can happen inside the Linux kernel itself, i.e. would I have to switch to RT_PREEMPT, or even a non-USB interface and Xenomai, to get more reliable guarantees? Could /proc/sys/kernel/sched_rt_runtime_us be biting me? Are there any other factors I may have missed?
Another way to put this question is, what else can I do to reduce these latency anomalies without switching to a "harder" realtime environment?
Update: I have observed a new, "worse worst case" of about 118.4 ms (once over a total of around 25 million select() calls). Even when I am not using a kernel with any sort of realtime extension, I am somewhat worried by the fact that a deadline can apparently be missed by over a tenth of a second.

Without more information it is difficult to point to something specific, so I am just guessing here:
Interrupts and code that is triggered by interrupts take so much time in the kernel that your real time thread is significantly delayed. This depends on the frequency of interrupts, which interrupt handlers are involved, etc.
A thread with lower priority will not be interrupted inside the kernel until it yields the cpu or leaves the kernel.
As pointed out in this SO answer, CPU System Management Interrupts and Thermal Management can also cause significant time delays (up to 300ms were observed by the poster).
118ms seems quite a lot for a 1.6GHz CPU. But one driver that accidently locks the cpu for some time would be enough. If you can, try to disable some drivers or use different driver/hardware combinations.
sched_rt_period_us and sched_rt_period_us should not be a problem if they are set to reasonable values and your code behaves as you expect. Still, I would remove the limit for RT threads and see what happens.
What else can you do? Write a device driver! It's not that difficult and interrupt handlers get a higher priority than realtime threads. It may be easier to switch to a real time kernel but YMMV.

Scheduling routines in C and timing requirements

I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.

You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.

Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.

disable_local_irq and kernel timers

While doing SMP porting of some of our drivers (on
powerpc target) we observed some behavior on which I need you guys to
shed some light:
On doing a local_irq_disable() on a UP system the jiffies tend to
freeze i.e. the count stops incrementing. Is this expected? I thought
that the decrementer interrupt is 'internal' and should not get
affected by the local_irq_disable() kind off call since I expected it to
disable local IRQ interrupt processing (external interrupt). The
system of course freezes then also upon doing a local_irq_enable() the
jiffies count jumps and it seems to be compensating for the 'time
lapse' between the local_irq_disable() and enable() call.
Doing the same on an SMP system (P2020 with 2 e500 cores) the
results are surprising. Firstly the module that is being inserted to
do this testing always executes on core 1. Further it sometimes does
not see a freeze of 'jiffies' counter and sometimes we see that it
indeed freezes. Again in case of a freeze of count it tends to jump
after doing a local_irq_enable(). I have no idea why this may be
happening.
Do we know in case of an SMP do both cores run a schedule timer, so
that in some cases we do not see a freeze of jiffies counts or is it
just on core 0 ?
Also since the kernel timers rely on 'jiffies' -- this would mean that
none of our kernel timers will fire if local_irq_disable() has been
done? What would be the case this is done on one of the cores in an
SMP system?
There are many other questions, but I guess these will be enough to
begin on a general discussion about the same :)
TIA
NS
Some more comments from the experimentation done.
My understanding at this point in time is that since kernel timers depend on 'jiffies' to fire, they wont actually fire on a UP system when I issue a local_irq_save(). Infact some of our code is based on the assumption that when I do issue a local_irq_save() it guarantees protection against interrupts on the local processor and kernel timers as well.
However carrying out the same experiment on an SMP system, even with both cores executing a local_irq_save(), the jiffies do NOT stop incrementing and the system doesn't freeze. How is this possible ? Is LINUX using some other mechanism to trigger timer interrupts in the SMP system or possibly using IPIs? This also breaks our assumption that local_irq_disable() will protect the system against kernel timers running on the same core atleast.
How do we go about writing a code that is safe against async events i.e. interrupts and kernel timers and is valid for both UP and SMP.

local_irq_disable only disables interrupts on the current core, so, when you're single core, everything is disabled (including timer interrupts) and that is why jiffies are not updated.
When running on SMP, sometimes you happen to disable the interrupts on the core that's updating the jiffies, sometimes not.
This usually is not a problem, because interrupts are supposed to be disabled only for a very short periods, and all scheduled timers will fire after interrupts gets enabled again.
How do you know that your module always run on core 1? On current versions of the kernel, it may even be running on more than one core at the same time (that is, if you didn't forced it to don't do it).

There are several facets to this problem. Lets take them 1 by 1.
1.
a)
local_irq_save() simply clears the IF flag of the eflags register. IRQ handlers can run concurently on the other cores.
global_irq_save() is not available because that would required interprocessor communication to implement and it is not really needed anyway since local irq disabling is intended for very short period of time only.
b)
modern APICs allows IRQ dynamic distribution among the present cores and besides rare exceptions, the kernel essentially programs the necessary registers to obtain a round-robin distribution of the IRQs.
The consequence of that is that if the irqs are disabled long enough locally, when the APIC delivers an IRQ to the core that has them disabled, the end result will be that the system will globally stop receiving this particular IRQ up to the point where the irqs are finally reenabled locally on the core that received the last IRQ of that type.
2.
Concerning the different results concerning jiffies updates and irq disabling, it depends on the selected clocksource.
You can figure out which one is choosen by consulting:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
if you have tsc as clocksource then all cores have it locally. However if your clocksource is something else ie: HPET an external device, then jiffies will become frozen for the reasons described in point #1.

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.

Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.

You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)

You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.

Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.

I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight