Distributed Mutual Exclusion: Coterie Formation - database

I have been studying distributed mutual exclusion algorithms based on the concept of Quorums.
Quoting:
A Coterie C is defined as a set of sets, where each set g ∈ C is called a quorum.
The following properties hold for quorums in a coterie:
1) Intersection property: For every quorum g, h ∈ C, g ∩ h= ∅.
For example, sets {1,2,3}, {2,5,7} and {5,7,9} cannot be quorums in a
coterie because the first and third sets do not have a common element.
2) Minimality property: There should be no quorums g, h in coterie C such
that g ⊇ h. For example, sets {1,2,3} and {1,3} cannot be quorums in a
coterie because the first set is a superset of the second.
I would like to know that, given a set of nodes in a distributed system, how are such coteries or set of quorums formed from such nodes?
What are the algorithms or techniques to do this?
UPDATE:
To put the problem in other words -
"Given 'N' nodes, what is the best way to form 'K' quorums such that any two of them have 'J' number of nodes in common?"

Simple algorithms for reading or writing would be, that you have to read from every node in a quorum and write to every node in a quorum. This way you can be sure that every other party in the system will read the latest written item.
Since your title is about mutual exclusions, A peer in the system can ask every node in a quorum for a lock to a resource. Due to the 1st rule, no other peer can get the lock from the whole quorum.
As far as I know you contact in practice random nodes and use as a quorum n/2 + 1 but as you can see, you can also define more sophisticated distributions which allow you to have smaller quorums, which again improves the performance.
Update:
Examples for such quorums with 9 servers could be the following:
2 quorums: servers 1-5 are one quorum and 5-9 would be another (simple majority)
3 quorums: servers 1,2,3,4; 4,5,6,7; and 7,8,9,1 could be 3 different quorums
more quorums: servers 1,2,3; 3,4,5; 5,6,1; 6,7,3; 8,3,1; 9,3,1; could be 6 different quorums. However here you can see that server 1 and 3 are part of 4 quorums each and will need to handle much more traffic for this reason.
you could potentially also create quorums like 1,2; 1,3; 1,4; 1,5; 1,6; 1,7; 1,8; 1,9; But this is the same as just having server 1.

Related

Apriori algrithm - finding associations in production data

I have problem with finding "correct" associations within production data.
The data looks like this
A;B;C;D;E;F;G
1;0;1;0;0;0;0
0;1;0;0;0;0;0
0;0;0;1;0;0;0
0;0;1;0;1;0;0
1;0;0;0;0;0;0
0;0;0;0;0;1;0
0;0;0;0;0;0;1
1;0;1;0;0;0;0
(Of course I have a lot more steps and rows)
Where A,B,C etc are production steps. 0 means that a worker did not perform this production step and 1 means that this step was performed by a worker. For example, first row - 1;0;1;0;0;0;0 means that steps A & C where performed at the same time by a worker. And second row -0;1;0;0;0;0;0 means that (perhaps another worker) performed only production step B.
So it happens that some of the production steps are usually performed simultaneously by the same worker, just like step A & C in the example above (2 out of 3 times they occur together). In order to find which steps tend to be performed together I applied apriori algorithm.
I hoped to receive answer like "If there is 1 in column A, it is likely that 1 will appear in column C". But instead, apriori algorithm found for me this "cool" rules which basically say that there are a lot of 0s in the table. Rules found where like this "If there is 0 in columns A and G, it is likely that there is 0 in column E" - thanks Sherlock
I need this algorithm to focus on rules connected to where are 1s in the table, not 0s. Basically any rule that looks at 0s can be ignored. I just want rules that look at 1s because I want to know which production steps tend to be performed together and I don't care which production steps are not performed together (0s) because obviously majority of the steps are not performed simultaneously.
Does anybody have some idea how to find associations between 1s instead of 0s?
I use Weka software to do the data mining.
Apriori has no notion of what the labels represent, they are just strings.
Have you tried the -Z option, treating the first label in attribute as missing?

Six degree of separation interview problem

A was asked an interesting question on an interview lately.
You have 1 million users
Each user has 1 thousand friends
Your system should efficiently answer on Do I know him? question for each couple of users. A user "knows" another one, if they are connected through 6 levels of friends.
E.g. A is friend of B, B is a friend of C, C is friend of D, D is a friend of E, E is a friend of F. So we can say that, A knows F.
Obviously you can't to solve this problem efficiently using BFS or other standard traversing technic. The question is - how to store this data structure in DB and how to quickly perform this search.
What's wrong with BFS?
Execute three steps of BFS from the first node, marking accessible users by flag 1. It requires 10^9 steps.
Execute three steps of BFS from the second node, marking accessible users by flag 2. If we meet mark 1 - bingo.
What about storing the data as 1 million x 1 million matrix A where A[i][j] is the minimum number of steps to reach from user i to user j. Then you can query it almost instantly. The update however is more costly.

what is the serializability graph of this?

I try to figure out a question, however I do not how to solve it, I am unannounced most of the terms in the question. Here is the question:
Three transactions; T1, T2 and T3 and schedule program s1 are given
below. Please draw the precedence or serializability graph of the s1
and specify the serializability of the schedule S1. If possible, write
at least one serial schedule. r ==> read, w ==> write
T1: r1(X);r1(Z);w1(X);
T2: r2(Z);r2(Y);w2(Z);w2(Y);
T3: r3(X);r3(Y);w3(Y);
S1: r1(X);r2(Z);r1(Z);r3(Y);r3(Y);w1(X);w3(Y);r2(Y);w2(Z);w2(Y);
I do not have any idea about how to solve this question, I need a detailed description. In which resource should I look for? Thank in advance.
There are various ways to test for serializability. The Objective of serializability is to find nonserial schedules that allow transactions to execute concurrently without interfering with one another.
First we do a Conflict-Equivalent Test. This will tell us whether the schedule is serializable.
To do this, we must define some rules (i & j are 2 transactions, R=Read, W=Write).
We cannot Swap the order of actions if equivalent to:
1. Ri(x), Wi(y) - Conflicts
2. Wi(x), Wj(x) - Conflicts
3. Ri(x), Wj(x) - Conflicts
4. Wi(x), Rj(x) - Conflicts
But these are perfectly valid:
R1(x), Rj(y) - No conflict (2 reads never conflict)
Ri(x), Wj(y) - No conflict (working on different items)
Wi(x), Rj(y) - No conflict (same as above)
Wi(x), Wj(y) - No conflict (same as above)
So applying the rules above we can derive this (using excel for simplicity):
From the result, we can clearly see with managed to derive a serial-relation (i.e. The schedule you have above, can be split into S(T1, T3, T2).
Now that we have a serializable schedule and we have the serial schedule, we now do the Conflict-Serialazabile test:
Simplest way to do this, using the same rules as the conflict-equivalent test, look for any combinations which would conflict.
r1(x); r2(z); r1(z); r3(y); r3(y); w1(x); w3(y); r2(y); w2(z); w2(y);
----------------------------------------------------------------------
r1(z) w2(z)
r3(y) w2(y)
w3(y) r2(y)
w3(y) w2(y)
Using the rules above, we end up with a table like above (e.g. we know reading z from one transaction and then writing z from another transaction will cause a conflict (look at rule 3).
Given the table, from left to right, we can create a precedence graph with these conditions:
T1 -> T2
T3 -> T2 (only 1 arrow per combination)
Thus we end up with a graph looking like this:
From the graph, since there it's acyclic (no cycle) we can conclude the schedule is conflict-serializable. Furthermore, since its also view-serializable (since every schedule that's conflict-s is also view-s). We could test the view-s to prove this, but it's rather complicated.
Regarding sources to learn this material, I recommend:
"Database Systems: A practical Approach To design, implementation and management: International Edition" by Thomas Connolly; Carolyn Begg - (It is rather expensive so I suggest looking for a cheaper, pdf copy)
Good luck!
Update
I've developed a little tool which will do all of the above for you (including graph). It's pretty simple to use, I've also added some examples.

Determining Number of Outputs for trading type ANN

I am currently trying to implement an ANN that does 1 for 1 Trades with 8 different possible Goods.
I am wondering how I determine the number of outputs necessery for the ANN to perform adequatly.
Should the number of outputs be equivalent to the number of possible trades? Meaning if I have 8 different Goods and can trade each one for each of the 8 Goods does the ANN need 8*8 outputs?
To summarize does an ANN need a number of outputs equal to the number of distinct actions it can perform?
edit: To clarify the goods have worth specific to a situation which is the input given to the ANN. the 8*8 is referring to the number of possible combination of trades one of the goods for any other.
Thank you in advance.
Classification engine
Neural networks (feed forward) are classification engines - they are not necessarily ment for "storing knowledge" such as decision trees and logical knowledge bases.
Though it certainly is possible to store predefined decisions inside a neural network - much like a gigantic if-clause.
Number of outputs
If the different outputs assigns different classes, you should use one outputs signal per classification instance.
If you were to decide to let one output signal imply different classes depending on the output value, you are hinting to the network that a output signal of 10 "is a better class" than one with output -10. Therefore I would strongly recommend to use one output signal per class, although this will require more training (at the advantage of possibly fewer plateaus in the search space).
I am not sure what you are refering to with:
Meaning if I have 8 different Goods and can trade each one for each of
the 8 Goods does the ANN need 8^8 outputs
Are you going to input a set if "stock values" and force the net to output which stocks to buy and sell?

What is a good approach to check if an item is in a very big hashset?

I have a hashset that cannot be entirely loaded into the memory. So let's say it has ABC part and each one could be loaded into memory but not all at one time.
I also have random entries coming in from time to time which I can barely tell which part it could potentially belong to. So one of the approaches could be that I load A first and then make a check, and then B, C. But next entry could belong to B so I have to unload C, and then load A, then B...Hopefully I make this understood.
This clearly would be very slow so I wonder is there a better way to do that? (if using db is not an alternative)
I suggest that you don't use some criteria to put data entry either to A or to B. In other words, A,B,C - it's just result of division of whole data to 3 equal parts. Am I right? If so I recommend you add some criteria when you adding new entry to your set. For example, if your entries are numbers put those who starts from 0-3 to A, those who starts from 4-6 - to B, from 7-9 to C. When your search something, you apriori now that you have to search in A or in B, or in C. If your entries are words - the same solution, but now criteria is first letter. May be here better use not 3 sets but 26 - size of english alphabet. Please note, that you anyway have to store one of sets in memory. You see one advantage - you do maximum 1 load/unload operation, you don't need to check all sets - you now which of them can really store your value. This idea is widely using in DB - partitioning. If you store in sets nor numbers nor words but some complex objects you anyway can invent some simple criteria.

Resources