To my understanding, strong consistency can be achieved when Vr + Vw > V. Vr is the read quorum (Vr), Vw is the write quorum. Assume V = 3.
When write a value (val = 2) to the DB, it only need to write success to 2 machines (e.g. Machine A, and B). when read a value from DB, it only needs to read from 2 machines if they return the same versioned value, in order to achieve strong consistency.
What if: after successfully persist val =2 to machine A an B, A went down and the value hasn't been replicated to machine C. Then when reading the value , it has to read from machine B and C, which has different value. Then which value will it choose as the latest result ?
It can't pick a value because there is no quorum on the version to pick. So effectively the system is unavailable for that read.
Related
I'm implementing B-trees in c language. To implement the b-trees I am following a certain pseudocode. In following this pseudocode i came across Disk-Read () and Disk-Write() operations that i don't know how to implement.
The idea is to save all the nodes in secondary memory excluding the root of the B-tree and every time I have to read a node I perform a Disk-Read () operation in secondary memory and every time I want to write to it to modify its value I perform a Disk-Write () operation in secondary memory.
Could anyone help me to implement these two procedures in c language?
I insert the pseudocodes of the search operation and the creation of an empty b-tree where these two procedures are called.
B-TreeCreate(T)
x = Allocate()
x.leaf = True
x.n = 0
DiskWrite(x)
T.root = x
B-TreeSearch(x,k)
i = 1
while ((i ≤ x.n) and (k > x.keyi )) i = i + 1
if ((i ≤ x.n) and (k = x.keyi ))
then return (x, i)
if (x.leaf = True)
then return nil
DiskRead(x.ci )
return BTreeSearch(x.ci,k)
Thanks again
Typically you would use either open() coupled with read()/write() or fopen() coupled with fread()/fwrite(). If trying to make more than a toy implementation you likely want to abstract this part away so that the IO system is easily replaced. (For example, if building for Windows there may be reason to use CreateFile() along with ReadFile()/WriteFile() instead). With proper I/O abstraction it would also be possible to have your btree backed by a compressed file.
Thee three sets of functions take different arguments, in different orders, but in the end all perform the same operations, that is opening a file and either transferring bytes from secondary storage to memory or transferring bytes from memory to secondary storage.
I want to understand following scenario:
3 node cluster ( n1, n2, n3 )
Assume n1 is the leader
Client send operation/request to n1
For some reason during the operation i.e. ( appendentry, commitEntry ... ) leader changes to n2
But n1 has successfully written to its log
Should this be considered as failure of operation/request ? or It should return "Leader Change"
This is an indeterminate result because we don't know at the time if the value will be committed or not. That is, the new leader may decide to keep or overwrite the value; we just do not have the information to know.
In some systems that I have maintained the lower-level system (the consensus layer) gives two success results for each request. It first returns Accepted when the leader puts it into its log and then Committed when the value is sufficiently replicated.
The Accepted result means that the value may be committed later, but it may not.
The layers that wrap the consensus layers can be a bit smarter about the return value to the client. They have a broader view of the system and may retry the requests and/or query the new leader about the value. That way the system-as-a-whole can have the customary single return value.
Problem & context
This post describes a lock-free 32-bit hash-table algorithm. The core of the algorithm is a lock-free linear search, which is used to insert a key-val pair in a (logical) list:
Here is the code provided:
void ArrayOfItems::SetItem(uint32_t key, uint32_t value)
{
for (uint32_t idx = 0;; idx++)
{
uint32_t prevKey = mint_compare_exchange_strong_32_relaxed(&m_entries[idx].key, 0, key);
if ((prevKey == 0) || (prevKey == key))
{
mint_store_32_relaxed(&m_entries[idx].value, value);
return;
}
}
}
For a specific problem, I need to insert random key-val pairs in a table. As such, I need at least 64-bit keys, because with 32-bits there is 50% chance of collision after 65536 insertions, which is too low. Unfortunately, I do not have 64-bit cmpxchg as a primitive.
Question
Is it possible to generalize the hash-table above to 64-bit keys, using only 32-bit cmpxchg?
I'm not sure from the question whether you are still wanting to retain the lock-free characteristic, or just want to get 64-bit key/value storage up & running. (?)
There is a 64-bit MurmurHash3 posted here by #kol:
hashing a small number to a random looking 64 bit integer
Clearly, if you introduced a second array to assert key-location ownership, and respected that for value storage, you could then read and CAS 64-bit values in two steps, then release ownership. That doesn't get you lock-free, of course.
--------------- Edit: ---------------
The author has at least two videos on his hash table, both from 2007:
Advanced Topics in Programming Languages: A Lock-Free Hash Table
https://www.youtube.com/watch?v=HJ-719EGIts
A Fast Wait-Free Hash Table
https://youtu.be/WYXgtXWejRM
He relates his program flow to a finite-state machine. Disregarding the issue of growing the table, there are four states a location can be in before a potential mutation is applied to that location. Key/Value = [nil/nil], [X/nil], [X/X], [nil/X].
Reading the state in preparation for mutation does not guarantee, under concurrency, that the state remains unchanged at the time the mutation is applied.
With 32 bit operations, we have the following logic:
- If the read key = the desired key, the value can be written to the location.
- If the read key = nil, and the value = non-nil, then another thread is mutating the location.
- if the read key = nil, and the value = nil, then the location can be written via a successful key CAS.
If you want to use 32 bit atomic operations to store 64 bit data without locking, then the state diagram is increased in size, with more failure states, eg:
- You could read a half-created Key.
- A CAS update of one half of a Value entry may be stomped by another thread, failing on the second CAS.
- A CAS creation of one half of a Key entry may be stomped by another thread, failing on the second CAS.
- The 32 bit representation of array initialiser 'nil' should be excluded from occurring as either half of a 64 bit key or value
The process of increasing the table size adds some more states to consider, also.
I want one primary collection of items of a single type that modifications are made to over time. Periodically, several slave collections are going to synchronize with the primary collection. The primary collection should send a delta of items to the slave collections.
Primary Collection: A, C, D
Slave Collection 1: A, C (add D)
Slave Collection 2: A, B (add C, D; remove B)
The slave collections cannot add or remove items on their own, and they may exist in a different process, so I'm probably going to use pipes to push the data.
I don't want to push more data than necessary since the collection may become quite large.
What kind of data structures and strategies would be ideal for this?
For that I use differential execution.
(BTW, the word "slave" is uncomfortable for some people, with reason.)
For each remote site, there is a sequential file at the primary site representing what exists on the remote site.
There is a procedure at the primary site that walks through the primary collection, and as it walks it reads the corresponding file, detecting differences between what currently exists on the remote site and what should exist.
Those differences produce deltas, which are transmitted to the remote site.
At the same time, the procedure writes a new file representing what will exist at the remote site after the deltas are processed.
The advantage of this is it does not depend on detecting change events in the primary collection, because often those change events are unreliable or can be self-cancelling or made irrelevant by other changes, so you cut way down on needless transmissions to the remote site.
In the case that the collections are simple lists of things, this boils down to having local copies of the remote collections and running a diff algorithm to get the delta.
Here are a couple such algorithms:
If the collections can be sorted (like your A,B,C example), just run a merge loop:
while(ix<nx && iy<ny){
if (X[ix] < Y[iy]){
// X[ix] was inserted in X
ix++;
} else if (Y[iy] < X[ix]){
// Y[iy] was deleted from X
iy++;
} else {
// the two elements are equal. skip them both;
ix++; iy++;
}
}
while(ix<nx){
// X[ix] was inserted in X
ix++;
}
while(iy<ny>){
// Y[iy] was deleted from X
iy++;
}
If the collections cannot be sorted (note relationship to Levenshtein distance),
Until we have read through both collections X and Y,
See if the current items are equal
else see if a single item was inserted in X
else see if a single item was deleted from X
else see if 2 items were inserted in X
else see if a single item was replaced in X
else see if 2 items were deleted from X
else see if 3 items were inserted in X
else see if 2 items in X replaced 1 items in Y
else see if 1 items in X replaced 2 items in Y
else see if 3 items were deleted from X
etc. etc. up to some limit
Performance is generally not an issue, because the procedure does not have to be run at high frequency.
There's a crude video demonstrating this concept, and source code where it is used for dynamically changing user interfaces.
If one doesn't push all data, sort of a log is required, which, instead of using pipe bandwidth, uses main memory. The parameter to find a good balance between CPU & memory usage would be the 'push' frequency.
From your question, I assume, you have more than one slave process. In this case, some shared memory or CMA (Linux) approach with double buffering in the master process should outperform multiple pipes by far, as it doesn't even require multithreaded pushing, which would be used to optimize the overall pipe throughput during synchronization.
The slave processes could be notified using a global synchronization barrier for reading from masterCollectionA without copying, while master modifies masterCollectionB (which is initialized with a copy from masterCollectionA) and vice versa. Access to a collection should be interlocked between slaves and master. The slaves could copy that collection (snapshot), if they would block it past the next update attempt from master, thus, allowing it to continue. Modifications in slave processes could be implemented with a copy on write strategy for single elements. This cooperative approach is rather simple to implement and in case the slave processes don't copy whole snapshots everytime, the overall memory consumption is low.
I've been using Redis for a while as a backend for Resque and now that I'm looking for a fast way to perform intersect operation on large sets of data, I decided to give Redis a shot.
I've been conducting the following test:
— x, y and z are Redis sets, they all contain approx. 1 million members (random integers taken from a seed array containing 3M+ members).
— I want to intersect x y and z, so I'm using sintersectstore (to avoid overheating caused by data retrieval from the server to the client)
sinterstore r x y z
— the resulting set (r) contains about half a million members, Redis computes this set in approximately half a second.
Half a second is not bad, but I would need to perform such calculations on sets that could contain more than a billion members each.
I haven't tested how Redis would react with such enormous sets but I assume it would take a lot more time to process the data.
Am I doing this right? Is there a faster way to do that?
Notes:
— native arrays aren't an option since I'm looking for a distributed data store that would be accessed by several workers.
— I get these results on a 8 cores #3.4Ghz Mac with 16GB of RAM, disk saving has been disabled on the Redis configuration.
I suspect that bitmaps are your best hope.
In my experience, redis is a perfect server for bitmaps; you would use the string data structure (one of the five data structures available in redis)
many or perhaps all of the operations you will need to perform are available out-of-the-box in redis, as atomic operations
the redis setbit operation has time complexity of O(1)
In a typical implementation, you would hash your array values to offset values on the bit string, then set each bit at its corresponding offset (or index); like so:
>>> r1.setbit('k1', 20, 1)
the first argument is the key, the second is the offset (index value) and the third is the value at that index on the bitmap.
to find if a bit is set at this offset (20), call getbit passing in the key for the bit string.
>>> r1.getbit('k1', 20)
then on those bitmaps, you can of course perform the usual bitwise operations e.g., logical AND, OR, XOR.