I was reading the following about vector clocks:
http://book.mixu.net/distsys/time.html
I must be missing something. Due to network latency is there always an "ack/nack"? If node B is updating node C and node C message that is updating node B is in transit, there must be an acknowledgment occurring as well.
No, ack/nack is from a different layer than vector clocks. You use ack/nack to acknowledge that a package was correctly transmitted / received via the transportation layer (TCP for instance).
Vector clocks are in the application layer. They ensure that you can order messages by the happened before logic. This is relevant for instance, for different scenarios when you store and replicate/synchronise data.
Usually in distributed algorithms you send (and therefore also receive) messages asynchronously, to divide work or updating results, without expecting any direct answer. Such an asynchronous communication reduces the network overhead and increases the throughput as well as speed at which the nodes can communicate.
Imagine two processes (people) working on the same dataset (document) and updating each other.
Processes A and B start both from 0, A's vector clock is (0,0) and B's as well (0,0) - each entry is for (A, B)
A makes a change and sends an update to B
A: (1, 0); B: (0, 0)
B makes a change and sends an update to A
A: (1, 0); B: (0, 1)
A makes another change and sends an update to B
A: (2, 0); B: (0, 1)
A receives B's update and sees on the vector clock (0, 1)
A now knows that B didn't apply any of A's updates to the document yet and know which data need to be merged
A: (2, 1); B: (0, 1)
B receives A's both messages after each other and needs to perform a similar merge
A: (2, 1); B: (2, 1)
Now if A needed to wait with the changes until B sends a confirmation, before sending another message, it would have wasted a lot of time. And in the end there would have been anyway a conflict since they sent the updates concurrently and would have either needed to merge them, or at least one process would have needed to withdraw it's changes.
Related
Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?
It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.
Problem statement:
Trying to evaluate Apache Flink for modelling advanced real time low latency distributed analytics
Use case abstract:
Provide complex analytics for instruments I1, I2, I3... etc each having product definition P1, P2, P3; configured with user parameters (Dynamic) U1, U2,U3 & requiring streaming Market Data M1, M2, M3...
Instrument Analytics function (A1,A2) are complex in terms of computation complexity, some of them could take 300-400ms but can be computed in parallel.
From above clearly Market data stream would be much faster (<1ms) than analytics function & need to consume latest consistent market data for calculations.
Next challenge is multiple Dependendant Enrichment functions E1,E2,E3 (e.g. Risk/PnL) which combine streaming Market data with instrument analytics result (E.g. Price or Yield)
Last challenge is consistency for calculations - as function A1 could be faster than A2 and need a consistent all instrument result from given market input.
Calculation Graph dependency examples (scale it to hundreds of instruments & 10-15 market data sources):
In case above image is not visible, graph dependency flow is like:
- M1 + M2 + P1 => A2
- M1 + P1 => A1
- A1 + A2 => E2
- A1 => E1
- E1 + E2 => Result numbers
Questions:
Correct design/model for these calculation data streams, currently I use ConnectedStreams for (P1 + M1), Another approach could be to use Iterative model feeding same instruments static data to itself again?
Facing issue to use just latest market data events in calculations as analytics function (A1) is lot slower than Market data (M1) streaming.
Hence need stale market data eviction for next iteration retaining those where no value is not available (LRU cache like)
Need to synchronize / correlate function execution of different time complexity so that iteration 2 starts only when everything in iteration 1 finished
This is quite a broad question and to answer it more precisely, one would need a few more details.
Below are a few thoughts that I hope will point you in a good direction and help you to approach your use case:
Connected streams by key (a.keyBy(...).connect(b.keyBy(...)) are the most powerful join- or union-like primitive. Using CoProcessFunction on a connected stream should give you the flexibility to correlate or join values as needed. You can for example store the events from one stream in the state while waiting for a matching event to arrive from the other stream.
Holding always the latest data of one input is easily doable by just putting that value into the state of a CoFlatMapFunction or a CoProcessFunction. For each event from input 1, you store the event in the state. Each event from stream 2, you look into the state to find the latest event from stream 1.
To synchronize on time, you could actually look into using event time. Event time can also be "logical time", meaning just a version number, iteration number, or anything. You only need to make sure that the timestamp you assign and the watermarks you generate reflect that consistently.
If you window by event time then, you will get all data of that version together, regardless of whether one operator is faster than others, or the events arrive via paths with different latency. That is the beauty of real event time processing :-)
I am trying to implement Neuro-Evolution of Augmenting Topologies in C#. I am running into a problem with recurrent connections. I understand that, for a recurrent connection, the output is basically temporally displaced.
http://i.imgur.com/FQYjCLZ.png
In the linked image, I show a pretty simple neural network with 2 inputs, 3 hidden nodes, and one output. Without an activation function or transfer function, I think it would be evaluated as:
n3[t] = (i1[t]*a + n6[t-1]*e)*d + i2[t]*b*c) * f
However I am having a hard time figuring out how to identify the fact that the link e is a recurrent connection. The paper that I read about NEAT showed how the minimal solutions of the XOR problem and the dual pole no velocity problem both had recurrent connections.
It seems rather straight forward if you have a fixed topology, because you can analyze the topology yourself, and identify which connections you need to time delay.
How exactly would you identify these connections?
I had a similar problem when i started implememting this paper. I don't know what your network looks like in the momen, so i'll explain to you what i did.
My network starts out as input & output layers only. To create connections and neurons i implemented some kind of DNA (in my case this is an array of instructions like 'connect neuron nr. 2 with neuron nr. 5 and set the weight to 0.4'). Every neuron in my network has a "layerNumber" which tells me where a neuron is in my network. This layerNumber is set for every in and output neuron. for inputneurons i used Double.minvalue and for outputneurons i used Double.maxvalue.
This is the basic setup. From now on just follow these rules when modifying the network:
Whenever you want to create a connection, make sure the 'from' neuron has a layerNumber < Double.maxValue
Whenever you want to create a connection, make sure that the 'to' neuron has a bigger layerNumber than the 'from' neuron.
Whenever a connection is split up into 2 connections and a neuron between them, set the neurons layerNumber to NeuronFrom.layerNumber*0.5 + NeuronTo.layerNumber*0.5
This is important, you can't add them and simply divide by 2, because this would likely result in Double.maxValue + something, which would return some weird number (i guess an overflow would happen, so you would get a negative number?).
If you follow all the rules you should always have forwarding connections only. No recurrent ones. If you want recurrent connections you can create them by just swapping 'from' & 'to' while creating a new connection.
Pro tricks:
Use only one ArrayList of Neurons.
Make the DNA use ID's of neurons to find them, but create a 'Connection' class which will have the Neuron objects as attributes.
When filtering your connections/neurons use ArrayList.stream().filter()
When later propagating trough the network you can just sort your neurons by layerNumber, set the inputValues and go trough all neurons using a for() loop. Just calculate the neurons outputvalue and transfer it to every neuron which has a connection where 'from' is == the current neuron.
Hope it's not too complicated...
Okay, so instead of telling you to just not have recurrent connections, i'm actually going to tell you how to identify them.
First thing you need to know is that recurrent connections are calculated after all other connections and neurons. So which connection is recurrent and which is not depends on the order of calculation of your NN.
Also, the first time when you put data into the system, we'll just assume that every connection is zero, otherwise some or all neurons can't be calculated.
Lets say we have this neural network:
Neural Network
We devide this network into 3 layers (even though conceptually it has 4 layers):
Input Layer [1, 2]
Hidden Layer [5, 6, 7]
Output Layer [3, 4]
First rule: All outputs from the output layer are recurrent connections.
Second rule: All outputs from the input layer may be calculated first.
We create two arrays. One containing the order of calculation of all neurons and connections and one containing all the (potentially) recurrent connections.
Right now these arrays look somewhat like this:
Order of
calculation: [1->5, 2->7 ]
Recurrent: [ ]
Now we begin by looking at the output layer. Can we calculate Neuron 3? No? Because 6 is missing. Can we calculate 6? No? Because 5 is missing. And so on. It looks somewhat like this:
3, 6, 5, 7
The problem is that we are now stuck in a loop. So we introduce a temporary array storing all the neuron id's that we already visited:
[3, 6, 5, 7]
Now we ask: Can we calculate 7? No, because 6 is missing. But we already visited 6...
[3, 6, 5, 7,] <- 6
Third rule is: When you visit a neuron that has already been visited before, set the connection that you followed to this neuron as a recurrent connection.
Now your arrays look like this:
Order of
calculation: [1->5, 2->7 ]
Recurrent: [6->7 ]
Now you finish the process and in the end join the order of calculation array with your recurrent array so, that the recurrent array follows after the other array.
It looks somethat like this:
[1->5, 2->7, 7, 7->4, 7->5, 5, 5->6, 6, 6->3, 3, 4, 6->7]
Let's assume we have [x->y, y]
Where x->y is the calculation of x*weight(x->y)
And
Where y is the calculation of Sum(of inputs to y). So in this case Sum(x->y) or just x->y.
There are still some problems to solve here. For example: What if the only input of a neuron is a recurrent connection? But i guess you'll be able to solve this problem on your own...
I want one primary collection of items of a single type that modifications are made to over time. Periodically, several slave collections are going to synchronize with the primary collection. The primary collection should send a delta of items to the slave collections.
Primary Collection: A, C, D
Slave Collection 1: A, C (add D)
Slave Collection 2: A, B (add C, D; remove B)
The slave collections cannot add or remove items on their own, and they may exist in a different process, so I'm probably going to use pipes to push the data.
I don't want to push more data than necessary since the collection may become quite large.
What kind of data structures and strategies would be ideal for this?
For that I use differential execution.
(BTW, the word "slave" is uncomfortable for some people, with reason.)
For each remote site, there is a sequential file at the primary site representing what exists on the remote site.
There is a procedure at the primary site that walks through the primary collection, and as it walks it reads the corresponding file, detecting differences between what currently exists on the remote site and what should exist.
Those differences produce deltas, which are transmitted to the remote site.
At the same time, the procedure writes a new file representing what will exist at the remote site after the deltas are processed.
The advantage of this is it does not depend on detecting change events in the primary collection, because often those change events are unreliable or can be self-cancelling or made irrelevant by other changes, so you cut way down on needless transmissions to the remote site.
In the case that the collections are simple lists of things, this boils down to having local copies of the remote collections and running a diff algorithm to get the delta.
Here are a couple such algorithms:
If the collections can be sorted (like your A,B,C example), just run a merge loop:
while(ix<nx && iy<ny){
if (X[ix] < Y[iy]){
// X[ix] was inserted in X
ix++;
} else if (Y[iy] < X[ix]){
// Y[iy] was deleted from X
iy++;
} else {
// the two elements are equal. skip them both;
ix++; iy++;
}
}
while(ix<nx){
// X[ix] was inserted in X
ix++;
}
while(iy<ny>){
// Y[iy] was deleted from X
iy++;
}
If the collections cannot be sorted (note relationship to Levenshtein distance),
Until we have read through both collections X and Y,
See if the current items are equal
else see if a single item was inserted in X
else see if a single item was deleted from X
else see if 2 items were inserted in X
else see if a single item was replaced in X
else see if 2 items were deleted from X
else see if 3 items were inserted in X
else see if 2 items in X replaced 1 items in Y
else see if 1 items in X replaced 2 items in Y
else see if 3 items were deleted from X
etc. etc. up to some limit
Performance is generally not an issue, because the procedure does not have to be run at high frequency.
There's a crude video demonstrating this concept, and source code where it is used for dynamically changing user interfaces.
If one doesn't push all data, sort of a log is required, which, instead of using pipe bandwidth, uses main memory. The parameter to find a good balance between CPU & memory usage would be the 'push' frequency.
From your question, I assume, you have more than one slave process. In this case, some shared memory or CMA (Linux) approach with double buffering in the master process should outperform multiple pipes by far, as it doesn't even require multithreaded pushing, which would be used to optimize the overall pipe throughput during synchronization.
The slave processes could be notified using a global synchronization barrier for reading from masterCollectionA without copying, while master modifies masterCollectionB (which is initialized with a copy from masterCollectionA) and vice versa. Access to a collection should be interlocked between slaves and master. The slaves could copy that collection (snapshot), if they would block it past the next update attempt from master, thus, allowing it to continue. Modifications in slave processes could be implemented with a copy on write strategy for single elements. This cooperative approach is rather simple to implement and in case the slave processes don't copy whole snapshots everytime, the overall memory consumption is low.
I am making program, where will be 2-4 processes in MPI with C language. In elimination I am sending each row below actual row to different processes by cyclic mapping.
e.g. for matrix 5 * 6, when there is active row 0, and 4 processes, I am sending row 1 to process 1, row 2 to the process 2, row 3 to the process 3 and row 4 to the process 1 again. In theese processes I want to make some computation and return some values back to the process 0, which will add theese into its original matrix. My question is, which Send and Recv call should I use?
There is some theoretical code:
if(master){
for(...)
//sending values to other processes
}
for(...)
//recieving values from other processes
}
}else{//for other non-master processes
recieve value from master
doing some computing
sending value back to master
}
If I use only simple blocking send, I don't know what will happen on process 1, because it gets row 1 and 4, and before I will cal recv in master, my value in process 1 - row 1 will be overwritten by row 4, so I will miss one set of data, am I right? My question on therfore, what kind of MPI_Send should I use?
You might be able to make use of nonblocking if, say, proc 1 can start doing something with row 1 while waiting for row 4. But blocking should be fine to start with, too.
There is a lot of synchronization built into the algorithm. Everyone has to work based on the current row. So the recieving processes will need to know how much work to expect for each iteration of this procedure. That's easy to do if they know the total number of rows, and which iteration they're currently working on. So if they're expecting two rows, they could do two blocking recvs, or launch 2 nonblocking recvs, wait on one, and start processing the first row right away. But probably easiest to get blocking working first.
You may find it advantageous even at the start to have the master process doing isends so that all the sends can be "launched" simulataneously; then a waitall can process them in whatever order.
But better than this one-to-many communication would probably be to use a scatterv, which one could expect to proceed more efficiently.
Something that has been raised in many answers and comments to this series of questions but which I don't think you've ever actually addressed -- I really hope that you're just working through this project for educational purposes. It would be completely, totally, crazy to actually be implementing your own parallel linear algebra solvers when there are tuned and tested tools like Scalapack, PETSc, and Plapack out there free for the download.