Totally Ordered Multicast with Lamport Clocks without FIFO - distributed

Typically, I learned in lecture that Totally Ordered multicast with Lamport clocks can be achieved under assumption that network is reliable and FIFO multicast order.
This requires total message of N (initial broadcast) + N^2 (ACKs)
However, one of the question I had was optimize the above algorithm so that it has same total ordering but incur significant fewer total messages. Also, it should also work under reliable network, but without FIFO.
Can we actually achieve optimized algorithm (reduce number of messages) without FIFO guarantee in network?

Related

Monitoring Thread performance of server

I have developed a C server using gcc and pthreads that receives UDP packets and depending on the configuration either drops or forwards them to specific targets. In some cases these packets are untouched and just redirected, in some cases headers in the packet are modified, in other cases there is another module of the server that modifies every byte of the packet.
To configure this server, there is a GUI written in Java that connects to the C Server using TCP (to exchange configuration commands). There can be multiple connected GUIs at the same time.
In order to measure utilization of the server I have written kind of a module that starts two separate threads (#2 & #3). The main thread (#1) that does the whole forwarding work essentially works like the following:
struct monitoring_struct data; //contains 2 * uint64_t for start and end time among other fields
for(;;){
recvfrom();
data.start = current_time();
modifyPacket();
sendPacket(); //sometimes to multiple destinations
data.end = current_time();
writeDataToPipe();
}
The current_time function:
//give a timestamp in microsecond precision
uint64_t current_time(void){
struct timespec spec;
clock_gettime(CLOCK_REALTIME, &spec);
uint64_t ts = (uint64_t) ((((double) spec.tv_sec) * 1.0e6) +
(((double) spec.tv_nsec) / 1.0e3));
return ts;
}
As indicated in the main thread, the data struct is written into a pipe, where thread #2 waits to read from. Everytime there is data to be read from the pipe, thread #2 uses a given aggregation function that stores the data in another place in memory. Thread #3 is a loop, that always sleeps for ~1 sec and then sends out the aggregated values (median, avg, min, max, lower quartil, upper quartil, ...) and then resets the aggregated data. Thread #2 and #3 are synchronized by mutexes.
The GUI listens to this data (if the monitoring window is open) which is sent out via UDP to listeners (there can be more) and the GUI then converts the numbers into diagrams, graphs and "pressure" indicators.
I came up with this as this is in my mind the solution that interferes least of all with thread #1 (assuming that it is run on a multicore system, which it always is, and exclusively besides OS and maybe SSH).
As performance is critical for my server (version "1.0" with simpler configuration was able to manage the maximum amount of streams that were possible using gigabit ethernet) I would like to ask if have my solution may be not as good as I think it is to ensure the least performance hit on thread #1 and if you think there would better designs for that? At least I am unable to think of another solution that is not using locks on the data itself (avoiding the pipe, but potentially locking thread #1) or a shared list implementation using rwlock, with possible reader starvation.
There are scenarios where packets are larger, but we currently use the mode for performance measuring where 1 Streams sends exactly 1000 packets per second. We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
In the future I would like to add other milestone timestamps to thread #1, e.g. inside modifyPacket() (there are multiple steps) and before sendPacket().
I have tried tinkering with the current_time() function, mostly trying to remove it to save time by just storing the value of clock_gettime(), but in my simple test program the current_time() function always beat the clock_gettime.
Thanks in advance for any input.
if you think there would better designs for that?
The short answer is to use Data Plane Development Kit (DPDK) with its design patterns and libraries. It might be quite a learning curve, but in terms of performance it is the best solution at the moment. It is free and open source (BSD license).
A bit more detailed answer:
the data struct is written into a pipe
Since thread #1 and #2 are the threads of the same process, it would be much faster to pass data using shared memory, not pipes. Just like you used between threads #2 and #3.
thread #2 uses a given aggregation function that stores the data in another place in memory
Those two threads seems unnecessary. Thread #2 can read data passed by thread #1, aggregate and send it out?
I am unable to think of another solution that is not using locks on the data itself
Have a look at the lockless queues which are called "rings" in DPDK. The idea is to have a common circular buffer between threads and use lockless algorithms to enqueue/dequeue to/from the buffer.
We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
Measure the performance and find the bottlenecks (seems your are still not 100% sure what is the bottleneck in the code).
Just for the reference, Intel publishes the performance reports for DPDK. Those reference numbers for L3 forwarding (i.e. routing) are up to 30 million packet per second.
Sure, you might have less powerful processor and NIC, but few millions packets per second are reachable quite easily using the right techniques.

How to limit read speed from a tcp socket [duplicate]

I'm writing a client-server app using BSD sockets. It needs to run in the background, continuously transferring data, but cannot hog the bandwidth of the network interface from normal use. Depending on the speed of the interface, I need to throttle this connection to a certain max transfer rate.
What is the best way to achieve this, programmatically?
The problem with sleeping a constant amount of 1 second after each transfer is that you will have choppy network performance.
Let BandwidthMaxThreshold be the desired bandwidth threshold.
Let TransferRate be the current transfer rate of the connection.
Then...
If you detect your TransferRate > BandwidthMaxThreshold then you do a SleepTime = 1 + SleepTime * 1.02 (increase sleep time by 2%)
Before or after each network operation do a
Sleep(SleepTime)
If you detect your TransferRate is a lot lower than your BandwidthMaxThreshold you can decrease your SleepTime. Alternatively you could just decay/decrease your SleepTime over time always. Eventually your SleepTime will reach 0 again.
Instead of an increase of 2% you could also do an increase by a larger amount linearly of the difference between TransferRate - BandwidthMaxThreshold.
This solution is good, because you will have no sleeps if the user's network is already not as high as you would like.
The best way would be to use a token bucket.
Transmit only when you have enough tokens to fill a packet (1460 bytes would be a good amount), or if you are the receive side, read from the socket only when you have enough tokens; a bit of simple math will tell you how long you have to wait before you have enough tokens, so you can sleep that amount of time (be careful to calculate how many tokens you gained by how much you actually slept, since most operating systems can sleep your process for longer than you asked).
To control the size of the bursts, limit the maximum amount of tokens you can have; a good amount could be one second worth of tokens.
I've had good luck with trickle. It's cool because it can throttle arbitrary user-space applications without modification. It works by preloading its own send/recv wrapper functions which do the bandwidth calculation for you.
The biggest drawback I found was that it's hard to coordinate multiple applications that you want to share finite bandwidth. "trickled" helps, but I found it complicated.
Update in 2017: it looks like trickle moved to https://github.com/mariusae/trickle

Algorithm - Handling Jitter and Drift with External Codec/Modem

I am writing a small module in C to handle jitter and drift for a full-duplex audio system. It acts as a very primitive voice chat module, which connects to an external modem that uses a separate clock, independent from my master system clock (ie: it is not slaved off of the system master clock).
The source is based off of an existing example available online here: http://svn.xiph.org/trunk/speex/libspeex/jitter.c
I have 4 audio streams:
Network uplink (my voice, after processing, going to the far side speaker)
Network downlink (far side's voice, before processing, coming to me)
Speaker output (the far side's voice, after processing, to the local speakers)
Mic input (my voice, before processing, coming from the local microphone)
I have two separate threads of execution. One handles the local devices and buffer (ie: playing processed audio to the speakers, and capturing data from the microphone and passing it off to the DSP processing library to remove background noise, echo, etc). The other thread handles pulling the network downlink signal and passing it off to the processing library, and taking the processed data from the library and pushing it via the uplink connection.
The two threads use mutexes and a set of shared circular/ring buffers. I am looking for a way to implement a sure-fire (safe and reliable) jitter and drift correction mechanism. By jitter, I am referring to a clock having variable duty cycle, but the same frequency as an ideal clock.
The other potential issue I would need to correct is drift, which would assume both clocks use an ideal 50% duty cycle, but their base frequency is off by ±5%, for example.
Finally, these two issues can occur simultaneously. What would be the ideal approach to this? My current approach is to use a type of jitter buffer. They are just data buffers which implement a moving average to count their average "fill" level. If a thread tries to read from the buffer, and not-enough data is available and there is a buffer underflow, I just generate data for it on-the-fly by either providing a spare zeroed-out packet, or by duplicating a packet (ie: packet loss concealment). If data is coming in too quickly, I discard an entire packet of data, and keep going. This handles the jitter portion.
The second half of the problem is drift correction. This is where the average fill level metric comes in useful. For all buffers, I can calculate the relative growth/reduction levels in various buffers, and add or subtract a small number of samples every so often so that all buffer levels hover around a common average "fill" level.
Does this approach make sense, and are there any better or "industry standard" approaches to handling this problem?
Thank you.
References
Word Clock – What’s the difference between jitter and frequency drift?, Accessed 2014-09-13, <http://www.apogeedigital.com/knowledgebase/fundamentals-of-digital-audio/word-clock-whats-the-difference-between-jitter-and-frequency-stability/>
Jitter.c, Accessed 2014-09-13, <http://svn.xiph.org/trunk/speex/libspeex/jitter.c>
I faced a similar, although admittedly simpler, problem. I won't be able to fully answer your question but i hope sharing my solutions to some practical problems i ran into will benefit you anyway.
Last year i was working on a system which should simultaneously record from and render to multiple audio devices, each potentially ticking off a different clock. The most obvious example being a duplex stream on 2 devices, but it also handled multiple inputs/outputs only. All in all being a bit simpler than your situation (single threaded and no network i/o). In the end i don't believe dealing with more than 2 devices is harder than 2, any system with multiple clocks is going to have to deal with the same problems.
Some stuff i've learned:
Pick one stream and designate it's clock as "the truth" (i.e., sync all other streams to a common master clock). If you don't do this you won't have a well-defined notion of "current sample position", and without it there's nothing to sync to. This also has the benefit that at least one stream in the system will always be clean (no dropping/padding samples).
Your approach of using an additional buffer to handle jitter is correct. Without it you'd be constantly dropping/padding even on streams with the same nominal sample rate.
Consider whether or not you'd want to introduce such a jitter buffer for the "master" stream also. Doing so means introducing artificial latency in the master stream, not doing so means the rest of your streams will lag behind.
I'm not sure whether it's a good idea to drop entire packets. Why not try to use up as much of the samples as possible? Especially with large packet sizes this is far less noticeable.
To elaborate on the above, I got badly bitten by the following case: assume s1 (master) producing 48000 frames every second and s2 producing 96000 every 2 seconds. Round 1: read 48000 from s1, 0 from s2. Round 2: read 48000 from s1, 96000 from s2 -> overflow. Discard entire packet. Round 3: read 48000 from s1, 0 from s2. Etc. Obviously this is a contrived example but i ran into cases where on average I dropped 50% of secondary stream's data using this scheme. Introduction of the jitter buffer does help but didn't completely fix this problem. Note that this is not strictly related to clock jitter/skew, it's just that some drivers like to update their padding values periodically and they will not accurately report to you what is really in the hardware buffer.
Another variation on this problem happens when you really do got clock jitter but the API of your choice doesn't let you control packet size (e.g., allows you to request less frames than are actually available). Assume s1 (master) recording #1000 Hz and s2 alternating each second #1000 and 1001hz. Round 1, read 1000 frames from both. Round 2, read 1000 frames from s1, and 1001 from s2 -> overflow. Etc, on average you'll dump around 50% of frames on s2. Note that this is not so much a problem if your API lets you say "give me 1000 samples even though i know you've got more". By doing so though, you'll eventually overflow the hardware input buffer.
To have the most control over when to drop/pad, I found it easiest to allways keep input buffers empty and output buffers full. This way all dropping/padding takes place in the jitter buffer and you'll at least know and control what's happening.
If possible try to separate your program logic: the hard part is finding out where to pad/drop samples. Once you've got that in place it's easy to try different variations of pad/drop, sample-and-hold, interpolation etc.
All in all I'd say your solution looks very reasonable, although I'm not sure about the "drop entire packet thing" and I'd definitely pick one stream as the master to sync against. For completeness here's the solution I eventually came up with:
1 Assume a jitter buffer of size J on each stream.
2: Wait for a packet of size M to become available on the master stream (M is typically derived from the stream latency). We're going to deliver M frames of input/output to the app. I didn't implement an additional buffer on the master stream.
3: For all input streams: let H be the number of recorded frames in the hardware buffer, B be the number of recorded frames currently in the jitter buffer, and A being the number of frames available to the application: A equals H + B.
3a: If A < M, we have input underflow. Offer A recorded frames + (M - A) padding frames to the app. Since the device is likely slow, fill 1/2 of the jitter buffer with silence.
3b: If A == M, offer A frames to the app. The jitter buffer is now empty.
3c: If A > M but (A - M) <= J, offer M recorded frames to the app. A - M frames stay in the jitter buffer.
3d: If A > M and (A - M) > J, we have input overflow. Offer M recorded frames to the app, of the remaining frames put J/2 back in the jitter buffer, we use up M + J/2 frames and we drop A - (M + J/2) frames as overflow. Don't try to keep the jitter buffer full because the device is likely fast and we don't want to overflow again on the next round.
4: Sort of the inverse of 3: for outputs, fast devices will underflow, slow devices will overflow.
A, H and B are the same thing but this time they don't represent available frames but available padding (e.g., how much frames can i offer to the app to write to).
Try to keep hardware buffers full at all costs.
This scheme worked out quite well for me, although there's a few things to consider:
It involves a lot of bookkeeping. Make sure that for input buffers, data always flows from hardware->jitter buffer->application and for outputs always from app->jitter buffer->hardware. It's very easy to make the mistake of thinking you can "skip" frames in the jitter buffer if there's enough samples available from the hardware directly to the app. This will essentially mess up the chronological order of frames in an audio stream.
This scheme introduces variable latency on secondary streams because i try to postpone the moment of padding/dropping as long as possible. This may or may not be a problem. I found that in practice postponing these operations gives audibly better results, probably because many "minor" glitches of only a few samples are more annoying than the occasional larger hiccup.
Also, PortAudio (an open source audio project) has implemented a similar scheme, see http://www.portaudio.com/docs/proposals/001-UnderflowOverflowHandling.html. It may be worthwile to browse through the mailinglist and see what problems/solutions came up there.
Note that everything i've said so far is only about interaction with the audio hardware, i've no idea whether this will work equally well with the network streams but I don't see any obvious reason why not. Just pick 1 audio stream as the master and sync the other one to it and do the same for the network streams. This way you'll end up with two more-or-less independent systems connected only by the ringbuffer, each with an internally consistent clock, each running on it's own thread. If you're aiming for low audio latency, you'll also want to drop the mutexes and opt for a lock-free fifo of some sorts.
I am curious to see if this is possible. I'll throw in my two bits though.
I am a novice programmer, but studied audio engineering/interactive audio.
My first assumption is that this is not possible. At least not on a sample-to-sample basis. Especially not for complex audio data and waveforms such as human speech. The program could have no expectation of what the waveform "should" look like.
This is why there are high-end audio interfaces with temperature controlled internal clocks.
On the other hand, maybe there is a library that can detect the symptoms of jitter, somehow...
In which case I would be very curious to hear about it.
As far as drift correction, maybe I don't understand something on the programming front, but shouldn't you be pulling audio at a specific sample rate? I believe sample rate/drift is handled at the hardware level.
I really hope this helps. You might have to steer me closer to home.

Unicast source address filtering in datagram socket

I need to perform data filtering based on the source unicast IPv4 address of datagrams arriving to a Linux UDP socket.
Of course, it is always possible to manually perform the filtering based on the information provided by recvfrom, but I am wondering if there could be another more intelligent/efficient approach (if possible, not using libpcap).
Any ideas?
If it's a single source you need to allow, then use just connect(2) and kernel will do filtering for you. As a bonus, connected UDP sockets are more efficient. This, of cource, does not work for more then one source.
As already stated, NetFilter (the Linux firewall) can help you here.
You could also use the UDP options of xinetd and tcpd to perform filtering.
What proportion of datagrams are you expecting to discard? If it is very high, then you may want to review your application design (for example, to make the senders not send so many datagrams which are to be discarded). If it is not very high, then you don't really care about how much effort you spend discarding them.
Suppose discarding a packet takes the same amount of (runtime) effort as processing it normally; if you discard 1% of packets, you will only be spending 1% of time discarding. However, realistically, discarding is likely to be much easier than processing messages.

How to test the speed for Socket?

I write a program which can forward ip packets between 2 servers, so how to test the speed of the program ? thanks!
There are a number of communication metrics that may be of interest to your potential users.
Latency is the amount of time to send a message, usually quoted in microseconds for co-located devices and in milliseconds for all other scenarios. It is usually quoted as the "zero-byte latency", meaning the time required to transmitted the meta-data of a message. Lower is better.
Bandwidth is measured in bits per second. It is often quoted as "peak bandwidth" and can be obtained by sending a massive amount of data over the line. Higher is better.
CPU utilization is the percent of CPU time required to transmit a message. Network protocols that can offload a message's transmission have low utilization, which means that the communication can "overlap" some other computation in the user's application, which has the effect of hiding latency. Lower is better.
All of these are measured simply by a variation of the ping test, usually called the "ping-pong":
Node 1:
for n = 1 to MAXSIZE, step via n*=2
send message of size n bytes
receive a response of size n bytes
Node 2:
for n = 1 to MAXSIZE, step via n*=2
receive a message of size n bytes
send response of size n bytes
There's also a "ping-ping" test, in which both nodes write to each other at the same time. This requires non-blocking communication to set-up.
Just output n and the time required for each iteration. The first time is the zero-byte latency. The largest sustainable n/time is the bandwidth (convert to bits per second to be industry standard). You can also measure the CPU utilization required to run the larger iterations, but that's a tricky topic for a whole different question.
Take a look at iperf. You can find it at http://sourceforge.net/projects/iperf/ If you google around you will find tutorials for it. You can look at the source and might get some good ideas of how he does it. I use it for routine testing and it is quite robust

Resources