Calculating CAN bus speed

Calculating CAN bus speed - c

I need to validate and characterize CAN bus traffic for our product (call it the Unit Under Test, UUT). I have a machine that sends a specified number of can frames to our product. Our product is running a Linux based custom kernel. The CAN frames are pre-built in software on the sender machine using a specific algorithm. The UUT uses the algorithm to verify the received frames.
Also, and here is where my questions lie, I am trying to calculate some timing data in the UUT software. So I basically do a read loop as fast as possible. I have a pre-allocated buffer to store the frames, so I just call read and increment the pointer to the buffer:
clock_gettime(clocK_PROCESS_CPUTIME_ID, timespec_start_ptr);
while ((frames_left--) > 0)
read(can_sock_fd, frame_mem_ptr++, sizeof(struct can_frame));
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, timespec_stop_ptr);
My question has to do with the times I get when I calculate the difference in these two timespecs (the calculation I use is correct I have verified it, it is GNUs algorithm).
Also, running the program under the time utility agrees with my times. For example, my program is called tcan, so I might run
[prompt]$ time ./tcan can1 -nf 10000
to run on can1 socket with 10000 frames. (This is FlexCAN, socket based interface, BTW)
Then, I use the time difference to calculate the data transfer speed I obtained. I received num_frames in the time span, so I calculate the frames/sec and the bits/sec
I am getting bus speeds that are 10 times the CAN bus speed of 250000 bits per sec. How can this be? I only get 2.5% CPU utilization according to both my program and the time program (and the top utility as well).
Are the values I am calculating meaningful? Is there something better I could do? I am assuming that since time reports real times that are much greater than user+sys, there must be some time-accounting lost somewhere. Another possibility is that maybe it's correct, I don't know, it's puzzling.

This is kind of a long shot, but what if read() is returning early because otherwise it would have to wait for incoming data? The fastest data to read is none at all :)
It would mess up the timings, but have you tried doing this loop whilst error checking? Or implement the loop via a recv() which should block unless you have asked it not to?
Hopefully this helps.

Related

How to handle asynchronous input and synchronous output?

I am currently working on a project where I have USART input and SAI(Serial Audio Interface, similar to SPI) output on an STM32 system.
I created a circular buffer which act as pinpong buffer(double buffer) structure. The input samples which received from USART are stored in this buffer where the head pointer points. When SAI peripheral requests new data, data is pulled from this buffer's tail pointer.
At the start of my code I wait until half the buffer is filled then activate SAI. SAI outputs at constant rate which is 40kHz. Input samples are received from external device's USART at approximately at same rate 40kHz.
Ideally, I expect the difference between my head and tail pointer to be constant.
I also implemented a protection mechanism which makes the Tail pointer wait and output the last sample from SAI until half of the buffer to fill when two pointers are pointing at same location.
The code works at start. The problem is when some time passes like approximately 2 minutes we see the head and tail pointers are pointing at the same location which creates discontinuity in our samples. Which means the one pointer is slow or fast than expected. I am sure of SAI protocol outputting 40kHz constantly (I checked it with scope). However, I am not so sure about accuracy of USARTs timing. I cannot modify the external USART device's code and I cannot change the output rate 40kHz it must be this value.
Is there a another way (maybe other than ping pong buffer method) to handle asynchronous input and synchronous outputs?

If what you are saying is you are receiving (continuous) serial data from some external device, and then you are forwarding it out some interface of your own at some rate...based on your clock. Even if the data is the same format and the clocks are "the same", then a buffer overflow is expected somewhere.
Same thing happens with ethernet or any other source if 1) continuous at line rate 2) source for the input and source for the output are a different clock 3) there are guaranteed to be differences in the reference clocks, if the input source clock is a little faster then so long as the stream stays continuous and at line rate, then you will overflow eventually.
The clocks change with temperature and voltage so the delta can change.
Possible to even reduce the percentage of the data you output from the input and still overflow if the input is continuous. Depends on if your output is also at line rate or of you have margin and the margin can overcome the difference in the clocks.
Also remember uarts hardly run at that exact rate, the use clock dividers and get close, you can have two computers using uarts at the same rate and the delta can be relatively large and the overflow can happen very soon. For uart to work the clock has to only be good enough to get through one character so can be several percent off if not more than that, even if the oscillator is very good and no plls and both sides use the same reference clock (but not the same uart, clocking system, etc).
If you increase your output rate or reduce the data being output so that it is not at line rate then the problem may go away or may take hours or days before it happens...
if I have misunderstood the problem, forgive me, I will delete this answer.

How to limit read speed from a tcp socket [duplicate]

I'm writing a client-server app using BSD sockets. It needs to run in the background, continuously transferring data, but cannot hog the bandwidth of the network interface from normal use. Depending on the speed of the interface, I need to throttle this connection to a certain max transfer rate.
What is the best way to achieve this, programmatically?

The problem with sleeping a constant amount of 1 second after each transfer is that you will have choppy network performance.
Let BandwidthMaxThreshold be the desired bandwidth threshold.
Let TransferRate be the current transfer rate of the connection.
Then...
If you detect your TransferRate > BandwidthMaxThreshold then you do a SleepTime = 1 + SleepTime * 1.02 (increase sleep time by 2%)
Before or after each network operation do a
Sleep(SleepTime)
If you detect your TransferRate is a lot lower than your BandwidthMaxThreshold you can decrease your SleepTime. Alternatively you could just decay/decrease your SleepTime over time always. Eventually your SleepTime will reach 0 again.
Instead of an increase of 2% you could also do an increase by a larger amount linearly of the difference between TransferRate - BandwidthMaxThreshold.
This solution is good, because you will have no sleeps if the user's network is already not as high as you would like.

The best way would be to use a token bucket.
Transmit only when you have enough tokens to fill a packet (1460 bytes would be a good amount), or if you are the receive side, read from the socket only when you have enough tokens; a bit of simple math will tell you how long you have to wait before you have enough tokens, so you can sleep that amount of time (be careful to calculate how many tokens you gained by how much you actually slept, since most operating systems can sleep your process for longer than you asked).
To control the size of the bursts, limit the maximum amount of tokens you can have; a good amount could be one second worth of tokens.

I've had good luck with trickle. It's cool because it can throttle arbitrary user-space applications without modification. It works by preloading its own send/recv wrapper functions which do the bandwidth calculation for you.
The biggest drawback I found was that it's hard to coordinate multiple applications that you want to share finite bandwidth. "trickled" helps, but I found it complicated.
Update in 2017: it looks like trickle moved to https://github.com/mariusae/trickle

Algorithm - Handling Jitter and Drift with External Codec/Modem

I am writing a small module in C to handle jitter and drift for a full-duplex audio system. It acts as a very primitive voice chat module, which connects to an external modem that uses a separate clock, independent from my master system clock (ie: it is not slaved off of the system master clock).
The source is based off of an existing example available online here: http://svn.xiph.org/trunk/speex/libspeex/jitter.c
I have 4 audio streams:
Network uplink (my voice, after processing, going to the far side speaker)
Network downlink (far side's voice, before processing, coming to me)
Speaker output (the far side's voice, after processing, to the local speakers)
Mic input (my voice, before processing, coming from the local microphone)
I have two separate threads of execution. One handles the local devices and buffer (ie: playing processed audio to the speakers, and capturing data from the microphone and passing it off to the DSP processing library to remove background noise, echo, etc). The other thread handles pulling the network downlink signal and passing it off to the processing library, and taking the processed data from the library and pushing it via the uplink connection.
The two threads use mutexes and a set of shared circular/ring buffers. I am looking for a way to implement a sure-fire (safe and reliable) jitter and drift correction mechanism. By jitter, I am referring to a clock having variable duty cycle, but the same frequency as an ideal clock.
The other potential issue I would need to correct is drift, which would assume both clocks use an ideal 50% duty cycle, but their base frequency is off by ±5%, for example.
Finally, these two issues can occur simultaneously. What would be the ideal approach to this? My current approach is to use a type of jitter buffer. They are just data buffers which implement a moving average to count their average "fill" level. If a thread tries to read from the buffer, and not-enough data is available and there is a buffer underflow, I just generate data for it on-the-fly by either providing a spare zeroed-out packet, or by duplicating a packet (ie: packet loss concealment). If data is coming in too quickly, I discard an entire packet of data, and keep going. This handles the jitter portion.
The second half of the problem is drift correction. This is where the average fill level metric comes in useful. For all buffers, I can calculate the relative growth/reduction levels in various buffers, and add or subtract a small number of samples every so often so that all buffer levels hover around a common average "fill" level.
Does this approach make sense, and are there any better or "industry standard" approaches to handling this problem?
Thank you.
References
Word Clock – What’s the difference between jitter and frequency drift?, Accessed 2014-09-13, <http://www.apogeedigital.com/knowledgebase/fundamentals-of-digital-audio/word-clock-whats-the-difference-between-jitter-and-frequency-stability/>
Jitter.c, Accessed 2014-09-13, <http://svn.xiph.org/trunk/speex/libspeex/jitter.c>

I faced a similar, although admittedly simpler, problem. I won't be able to fully answer your question but i hope sharing my solutions to some practical problems i ran into will benefit you anyway.
Last year i was working on a system which should simultaneously record from and render to multiple audio devices, each potentially ticking off a different clock. The most obvious example being a duplex stream on 2 devices, but it also handled multiple inputs/outputs only. All in all being a bit simpler than your situation (single threaded and no network i/o). In the end i don't believe dealing with more than 2 devices is harder than 2, any system with multiple clocks is going to have to deal with the same problems.
Some stuff i've learned:
Pick one stream and designate it's clock as "the truth" (i.e., sync all other streams to a common master clock). If you don't do this you won't have a well-defined notion of "current sample position", and without it there's nothing to sync to. This also has the benefit that at least one stream in the system will always be clean (no dropping/padding samples).
Your approach of using an additional buffer to handle jitter is correct. Without it you'd be constantly dropping/padding even on streams with the same nominal sample rate.
Consider whether or not you'd want to introduce such a jitter buffer for the "master" stream also. Doing so means introducing artificial latency in the master stream, not doing so means the rest of your streams will lag behind.
I'm not sure whether it's a good idea to drop entire packets. Why not try to use up as much of the samples as possible? Especially with large packet sizes this is far less noticeable.
To elaborate on the above, I got badly bitten by the following case: assume s1 (master) producing 48000 frames every second and s2 producing 96000 every 2 seconds. Round 1: read 48000 from s1, 0 from s2. Round 2: read 48000 from s1, 96000 from s2 -> overflow. Discard entire packet. Round 3: read 48000 from s1, 0 from s2. Etc. Obviously this is a contrived example but i ran into cases where on average I dropped 50% of secondary stream's data using this scheme. Introduction of the jitter buffer does help but didn't completely fix this problem. Note that this is not strictly related to clock jitter/skew, it's just that some drivers like to update their padding values periodically and they will not accurately report to you what is really in the hardware buffer.
Another variation on this problem happens when you really do got clock jitter but the API of your choice doesn't let you control packet size (e.g., allows you to request less frames than are actually available). Assume s1 (master) recording #1000 Hz and s2 alternating each second #1000 and 1001hz. Round 1, read 1000 frames from both. Round 2, read 1000 frames from s1, and 1001 from s2 -> overflow. Etc, on average you'll dump around 50% of frames on s2. Note that this is not so much a problem if your API lets you say "give me 1000 samples even though i know you've got more". By doing so though, you'll eventually overflow the hardware input buffer.
To have the most control over when to drop/pad, I found it easiest to allways keep input buffers empty and output buffers full. This way all dropping/padding takes place in the jitter buffer and you'll at least know and control what's happening.
If possible try to separate your program logic: the hard part is finding out where to pad/drop samples. Once you've got that in place it's easy to try different variations of pad/drop, sample-and-hold, interpolation etc.
All in all I'd say your solution looks very reasonable, although I'm not sure about the "drop entire packet thing" and I'd definitely pick one stream as the master to sync against. For completeness here's the solution I eventually came up with:
1 Assume a jitter buffer of size J on each stream.
2: Wait for a packet of size M to become available on the master stream (M is typically derived from the stream latency). We're going to deliver M frames of input/output to the app. I didn't implement an additional buffer on the master stream.
3: For all input streams: let H be the number of recorded frames in the hardware buffer, B be the number of recorded frames currently in the jitter buffer, and A being the number of frames available to the application: A equals H + B.
3a: If A < M, we have input underflow. Offer A recorded frames + (M - A) padding frames to the app. Since the device is likely slow, fill 1/2 of the jitter buffer with silence.
3b: If A == M, offer A frames to the app. The jitter buffer is now empty.
3c: If A > M but (A - M) <= J, offer M recorded frames to the app. A - M frames stay in the jitter buffer.
3d: If A > M and (A - M) > J, we have input overflow. Offer M recorded frames to the app, of the remaining frames put J/2 back in the jitter buffer, we use up M + J/2 frames and we drop A - (M + J/2) frames as overflow. Don't try to keep the jitter buffer full because the device is likely fast and we don't want to overflow again on the next round.
4: Sort of the inverse of 3: for outputs, fast devices will underflow, slow devices will overflow.
A, H and B are the same thing but this time they don't represent available frames but available padding (e.g., how much frames can i offer to the app to write to).
Try to keep hardware buffers full at all costs.
This scheme worked out quite well for me, although there's a few things to consider:
It involves a lot of bookkeeping. Make sure that for input buffers, data always flows from hardware->jitter buffer->application and for outputs always from app->jitter buffer->hardware. It's very easy to make the mistake of thinking you can "skip" frames in the jitter buffer if there's enough samples available from the hardware directly to the app. This will essentially mess up the chronological order of frames in an audio stream.
This scheme introduces variable latency on secondary streams because i try to postpone the moment of padding/dropping as long as possible. This may or may not be a problem. I found that in practice postponing these operations gives audibly better results, probably because many "minor" glitches of only a few samples are more annoying than the occasional larger hiccup.
Also, PortAudio (an open source audio project) has implemented a similar scheme, see http://www.portaudio.com/docs/proposals/001-UnderflowOverflowHandling.html. It may be worthwile to browse through the mailinglist and see what problems/solutions came up there.
Note that everything i've said so far is only about interaction with the audio hardware, i've no idea whether this will work equally well with the network streams but I don't see any obvious reason why not. Just pick 1 audio stream as the master and sync the other one to it and do the same for the network streams. This way you'll end up with two more-or-less independent systems connected only by the ringbuffer, each with an internally consistent clock, each running on it's own thread. If you're aiming for low audio latency, you'll also want to drop the mutexes and opt for a lock-free fifo of some sorts.

I am curious to see if this is possible. I'll throw in my two bits though.
I am a novice programmer, but studied audio engineering/interactive audio.
My first assumption is that this is not possible. At least not on a sample-to-sample basis. Especially not for complex audio data and waveforms such as human speech. The program could have no expectation of what the waveform "should" look like.
This is why there are high-end audio interfaces with temperature controlled internal clocks.
On the other hand, maybe there is a library that can detect the symptoms of jitter, somehow...
In which case I would be very curious to hear about it.
As far as drift correction, maybe I don't understand something on the programming front, but shouldn't you be pulling audio at a specific sample rate? I believe sample rate/drift is handled at the hardware level.
I really hope this helps. You might have to steer me closer to home.

Why does the measured network latency change if I use a sleep?

I'm trying to determine the time that it takes for a machine to receive a packet, process it and give back an answer.
This machine, that I'll call 'server', runs a very simple program, which receives a packet (recv(2)) in a buffer, copies the received content (memcpy(3)) to another buffer and sends the packet back (send(2)). The server runs NetBSD 5.1.2.
My client measures the round-trip time a number of times (pkt_count):
struct timespec start, end;
for(i = 0; i < pkt_count; ++i)
{
printf("%d ", i+1);
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
//struct timespec nsleep = {.tv_sec = 0, .tv_nsec = 100000};
//nanosleep(&nsleep, NULL);
printf("%.3f ", timespec_diff_usec(&end, &start));
}
I removed error checks and other minor things for clarity. The client runs on an Ubuntu 12.04 64-bit. Both programs run in real-time priority, although only the Ubuntu kernel is real time (-rt). The connection between the programs is TCP. This works fine and gives me an average of 750 microseconds.
However, if I enable the commented out nanosleep call (with a sleep of 100 µs), my measures drop 100 µs, giving an average of 650 µs. If I sleep for 200 µs, the measures drop to 550 µs, and so on. This goes up until a sleep of 600 µs, giving an average of 150 µs. Then, if I raise the sleep to 700 µs, my measures go way up to 800 µs in average. I confirmed my program's measures with Wireshark.
I can't figure out what is happening. I already set the TCP_NODELAY socket option in both client and server, no difference. I used UDP, no difference (same behavior). So I guess this behavior is not due to the Nagle algorithm. What could it be?
[UPDATE]
Here's a screenshot of the output of the client together with Wireshark. Now, I ran my server in another machine. I used the same OS with the same configuration (as it is a Live System in a pen drive), but the hardware is different. This behaviour didn't show up, everything worked as expected. But the question remains: why does it happen in the previous hardware?
[UPDATE 2: More info]
As I said before, I tested my pair of programs (client/server) in two different server computers. I plotted the two results obtained.
The first server (the weird one) is a RTD Single Board Computer, with a 1Gbps Ethernet interface. The second server (the normal one) is a Diamond Single Board Computer with a 100Mbps Ethernet interface. Both of them run the SAME OS (NetBSD 5.1.2) from the SAME pendrive.
From these results, I do believe that this behaviour is due either to the driver or the to NIC itself, although I still can't imagine why it happens...

OK, I reached a conclusion.
I tried my program using Linux, instead of NetBSD, in the server. It ran as expected, i.e., no matter how much I [nano]sleep in that point of the code, the result is the same.
This fact tells me that the problem might lie in the NetBSD's interface driver. To identify the driver, I read the dmesg output. This is the relevant part:
wm0 at pci0 dev 25 function 0: 82801I mobile (AMT) LAN Controller, rev. 3
wm0: interrupting at ioapic0 pin 20
wm0: PCI-Express bus
wm0: FLASH
wm0: Ethernet address [OMMITED]
ukphy0 at wm0 phy 2: Generic IEEE 802.3u media interface
ukphy0: OUI 0x000ac2, model 0x000b, rev. 1
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
So, as you can see, my interface is called wm0. According to this (page 9) I should check which driver is loaded by consulting the file sys/dev/pci/files.pci, line 625 (here). It shows:
# Intel i8254x Gigabit Ethernet
device wm: ether, ifnet, arp, mii, mii_bitbang
attach wm at pci
file dev/pci/if_wm.c wm
Then, searching through the driver source code (dev/pci/if_wm.c, here), I found a snippet of code that might change the driver behavior:
/*
* For N interrupts/sec, set this value to:
* 1000000000 / (N * 256). Note that we set the
* absolute and packet timer values to this value
* divided by 4 to get "simple timer" behavior.
*/
sc->sc_itr = 1500; /* 2604 ints/sec */
CSR_WRITE(sc, WMREG_ITR, sc->sc_itr);
Then I changed this 1500 value to 1 (trying to increase the number of interrupts per second allowed) and to 0 (trying to eliminate the interrupt throttling altogether), but both of these values produced the same result:
Without nanosleep: latency of ~400 us
With a nanosleep of 100 us: latency of ~230 us
With a nanosleep of 200 us: latency of ~120 us
With a nanosleep of 260 us: latency of ~70 us
With a nanosleep of 270 us: latency of ~60 us (minimum latency I could achieve)
With a nanosleep of anything above 300 us: ~420 us
This is, at least better behaved than the previous situation.
Therefore, I concluded that the behavior is due to the interface driver of the server. I am not willing to investigate it further in order to find other culprits, as I am moving from NetBSD to Linux for the project involving this Single Board Computer.

This is a (hopefully educated) guess, but I think it might explain what you're seeing.
I'm not sure how real time the linux kernel is. It might not be fully pre-emptive... So, with that disclaimer, continuing :)...
Depending on the scheduler, a task will possibly have what is called a "quanta", which is just an ammount of time it can run for before another task of same priority will be scheduled in. If the kernel is not fully pre-emptive, this might also be the point where a higher priority task can run. This depends on the details of the scheduler which I don't know enough about.
Anywhere between your first gettime and second gettime your task may be pre-empted. This just means that it is "paused" and another task gets to use the CPU for a certain amount of time.
The loop without the sleep might go something like this
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
<----- PREMPTION .. your tasks quanta has run out and the scheduler kicks in
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
// Because you got pre-empted, your time measurement is artifically long
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
<----- PREMPTION .. your tasks quanta has run out and the scheduler kicks in
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
and so on....
When you put the nanosecond sleep in, this is most likely a point where the scheduler is able to run before the current task's quanta expires (the same would apply to recv() too, which blocks). So perhaps what you get is something like this
clock_gettime(CLOCK_MONOTONIC, &start);
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);
struct timespec nsleep = {.tv_sec = 0, .tv_nsec = 100000};
nanosleep(&nsleep, NULL);
<----- PREMPTION .. nanosleep allows the scheduler to kick in because this is a pre-emption point
... another task runs for a little while
<----- PREMPTION again and your back on the CPU
// Now it so happens that because your task got prempted where it did, the time
// measurement has not been artifically increased. Your task then can fiish the rest of
// it's quanta
printf("%.3f ", timespec_diff_usec(&end, &start));
clock_gettime(CLOCK_MONOTONIC, &start);
... and so on
Some kind of interleaving will then occur where sometimes you are prempted between the two gettime()'s and sometimes outside of them because of the nanosleep. Depending on x, you might hit a sweet spot where you happen (by chance) to get your pre-emption point, on average, to be outside your time measurement block.
Anyway, that's my two-pennies worth, hope it helps explain things :)
A little note on "nanoseconds" to finish with...
I think one needs to be cautious with "nanoseconds" sleep. The reason I say this is that I think it is unlikely that an average computer can actually do this unless it uses special hardware.
Normally an OS will have a regular system "tick", generated at perhaps 5ms. This is an interrupt generated by say a RTC (Real Time Clock - just a bit of hardware). Using this "tick" the system then generates it's internal time representation. Thus, the average OS will only have a time resolution of a few milliseconds. The reason that this tick is not faster is that there is a balance to be achieved between keeping a very accurate time and not swamping the system with timer interrupts.
Not sure if I'm a little out of date with your average modern PC... I think some of them do have higher res timers, but still not into the nanosecond range and they might even struggle at 100uS.
So, in summary, keep in mind that the best time resolution you're likely to get is normally in the milliseconds range.
EDIT: Just revisiting this and thought I'd add the following... doesn't explain what your seeing but might provide another avenue for investigation...
As mentioned the timing accuracy of the nanosleep is unlikely to be better than milliseconds. Also your task can be pre-empted which will also cause timing problems. There is also the problem that the time taken for a packet to go up the protocol stack can vary, as well as network delay.
One thing you could try is, if your NIC supports, IEEE1588 (aka PTP). If your NIC supports it, it can timestamp PTP event packets as they leave and enter the PHY. This will give you the bes tpossible estimate of network delay. This eliminates any problems you might have with software pre-emption etc etc. I know squat about Linux PTP I'm afraid, but you could try http://linuxptp.sourceforge.net/

I think 'quanta' is the best theory for explanation.
On linux it is context switch frequency.
Kernel gives to process quanta time. But process is preempted in two situations:
Process call system procedure
quanta time is ended
hardware interrupt is comming (from network, hdd, usb, clock, etc...)
Unused quanta time is assigned to another ready to run process, using priorities/rt etc.
Actually context switch frequency is configured at 10000 times per second, it gives about 100us for quanta. but content switching takes a time, it is cpu depended, see this:
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
i dont understad, why content swith frequency is that high but it is discussion for linux kernel forum.
partially similar problem you can find here:
https://serverfault.com/questions/14199/how-many-context-switches-is-normal-as-a-function-of-cpu-cores-or-other

If the amount of data being sent by the application is large and fast enough, it could be filling the kernel buffers, which leads to a delay on each send(). Since the sleep is outside the measured section, it would then be eating the time that would otherwise be spent blocking on the send() call.
One way to help check for this case would be to run with a relatively small number of iterations, and then a moderate number of iterations. If the problem occurs with a small number of iterations (say 20) with small packet sizes (say <1k), then this is likely an incorrect diagnoses.
Keep in mind that your process and the kernel can easily overwhelm the network adapter and the wire-speed of the ethernet (or other media type) if sending data in a tight loop like this.
I'm having trouble reading the screen shots. If wireshark shows a constant rate of transmission on the wire, then it suggests this is the correct diagnoses. Of course, doing the math - dividing the wirespeed by the packet size (+ header) - should give an idea of the maximum rate at which the packets can be sent.
As for the 700 microseconds leading to increased delay, that's harder to determine. I don't have any thoughts on that one.

I have an advice on how to create a more accurate performance measurement.
Use the RDTSC instruction (or even better the intrinsic __rdtsc() function). This involves reading a CPU counter without leaving ring3 (no system call).
The gettime functions almost always involve a system call which slows things down.
Your code is a little tricky as it involves 2 system calls (send/recv), but in general it is better to call sleep(0) before the first measurement to ensure that the very short measurement doesn't receive a context switch. Off course the time measurement (and Sleep()) code should be disabled/enabled via macros in performance sensitive functions.
Some operating systems can be tricked into raising your process priority by having your process release it execution time window (e.g. sleep(0)). On the next schedule tick, the OS (not all) will raise the priority of your process since it didn't finish running its execution time quota.

How to test the speed for Socket?

I write a program which can forward ip packets between 2 servers, so how to test the speed of the program ? thanks!

There are a number of communication metrics that may be of interest to your potential users.
Latency is the amount of time to send a message, usually quoted in microseconds for co-located devices and in milliseconds for all other scenarios. It is usually quoted as the "zero-byte latency", meaning the time required to transmitted the meta-data of a message. Lower is better.
Bandwidth is measured in bits per second. It is often quoted as "peak bandwidth" and can be obtained by sending a massive amount of data over the line. Higher is better.
CPU utilization is the percent of CPU time required to transmit a message. Network protocols that can offload a message's transmission have low utilization, which means that the communication can "overlap" some other computation in the user's application, which has the effect of hiding latency. Lower is better.
All of these are measured simply by a variation of the ping test, usually called the "ping-pong":
Node 1:
for n = 1 to MAXSIZE, step via n*=2
send message of size n bytes
receive a response of size n bytes
Node 2:
for n = 1 to MAXSIZE, step via n*=2
receive a message of size n bytes
send response of size n bytes
There's also a "ping-ping" test, in which both nodes write to each other at the same time. This requires non-blocking communication to set-up.
Just output n and the time required for each iteration. The first time is the zero-byte latency. The largest sustainable n/time is the bandwidth (convert to bits per second to be industry standard). You can also measure the CPU utilization required to run the larger iterations, but that's a tricky topic for a whole different question.

Take a look at iperf. You can find it at http://sourceforge.net/projects/iperf/ If you google around you will find tutorials for it. You can look at the source and might get some good ideas of how he does it. I use it for routine testing and it is quite robust

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight