Efficient data structure and strategy for synchronizing several item collections - c

I want one primary collection of items of a single type that modifications are made to over time. Periodically, several slave collections are going to synchronize with the primary collection. The primary collection should send a delta of items to the slave collections.
Primary Collection: A, C, D
Slave Collection 1: A, C (add D)
Slave Collection 2: A, B (add C, D; remove B)
The slave collections cannot add or remove items on their own, and they may exist in a different process, so I'm probably going to use pipes to push the data.
I don't want to push more data than necessary since the collection may become quite large.
What kind of data structures and strategies would be ideal for this?

For that I use differential execution.
(BTW, the word "slave" is uncomfortable for some people, with reason.)
For each remote site, there is a sequential file at the primary site representing what exists on the remote site.
There is a procedure at the primary site that walks through the primary collection, and as it walks it reads the corresponding file, detecting differences between what currently exists on the remote site and what should exist.
Those differences produce deltas, which are transmitted to the remote site.
At the same time, the procedure writes a new file representing what will exist at the remote site after the deltas are processed.
The advantage of this is it does not depend on detecting change events in the primary collection, because often those change events are unreliable or can be self-cancelling or made irrelevant by other changes, so you cut way down on needless transmissions to the remote site.
In the case that the collections are simple lists of things, this boils down to having local copies of the remote collections and running a diff algorithm to get the delta.
Here are a couple such algorithms:
If the collections can be sorted (like your A,B,C example), just run a merge loop:
while(ix<nx && iy<ny){
if (X[ix] < Y[iy]){
// X[ix] was inserted in X
ix++;
} else if (Y[iy] < X[ix]){
// Y[iy] was deleted from X
iy++;
} else {
// the two elements are equal. skip them both;
ix++; iy++;
}
}
while(ix<nx){
// X[ix] was inserted in X
ix++;
}
while(iy<ny>){
// Y[iy] was deleted from X
iy++;
}
If the collections cannot be sorted (note relationship to Levenshtein distance),
Until we have read through both collections X and Y,
See if the current items are equal
else see if a single item was inserted in X
else see if a single item was deleted from X
else see if 2 items were inserted in X
else see if a single item was replaced in X
else see if 2 items were deleted from X
else see if 3 items were inserted in X
else see if 2 items in X replaced 1 items in Y
else see if 1 items in X replaced 2 items in Y
else see if 3 items were deleted from X
etc. etc. up to some limit
Performance is generally not an issue, because the procedure does not have to be run at high frequency.
There's a crude video demonstrating this concept, and source code where it is used for dynamically changing user interfaces.

If one doesn't push all data, sort of a log is required, which, instead of using pipe bandwidth, uses main memory. The parameter to find a good balance between CPU & memory usage would be the 'push' frequency.
From your question, I assume, you have more than one slave process. In this case, some shared memory or CMA (Linux) approach with double buffering in the master process should outperform multiple pipes by far, as it doesn't even require multithreaded pushing, which would be used to optimize the overall pipe throughput during synchronization.
The slave processes could be notified using a global synchronization barrier for reading from masterCollectionA without copying, while master modifies masterCollectionB (which is initialized with a copy from masterCollectionA) and vice versa. Access to a collection should be interlocked between slaves and master. The slaves could copy that collection (snapshot), if they would block it past the next update attempt from master, thus, allowing it to continue. Modifications in slave processes could be implemented with a copy on write strategy for single elements. This cooperative approach is rather simple to implement and in case the slave processes don't copy whole snapshots everytime, the overall memory consumption is low.

Related

Solution to handle billions of records for faster insertion and instant retrieval

I have a text file(call it grand parent file) which contains 1 million lines. Each of these lines contain absolute paths of some other files(call them parents) as shown below. The paths of parent files are unique.
%: cat input.txt - grand parent file
/root/a/b/c/1.txt -- parent file1
/root/a/b/c/2.txt -- parent file2 ......
...
/root/a/b/d/3.txt
......
.....
upto 1 million files.
Again each of the above parent file contains absolute paths of different files(Call them childs) and their line numbers as shown below: Same child files may be present in multiple parent files with same or different lumbers.
%: cat /root/a/b/c/1.txt -- parent file
s1.c,1,2,3,4,5 -- child file and its line numbers
s2.c,1,2,3,4,5....
...
upto thousands of files
%: cat /root/a/b/c/2.txt
s1.c,3,4,5
s2.c,1,2,3,4,5....
...
upto thousands of files
Now my requirement is that, given a child file and line number I need to return all the parent files that have the given child file number and line data present with in a minute. The insertion needs to be completed with in a day.
I created a relational database with following schema:
ParentChildMapping - Contains the required relation
ID AUTOINCREMENT PRIMARY KEY
ParentFileName TEXT
ChildFileName TEXT
LNumber INT
For a given file name and line number:
SELECT ParentFileName from ParentChildMapping where ChildFileName="s1.txt" and LNumber=1;
I divided grand parent file to 1000 separate sets each containing 1000 records. Then I have a python program which parses each set and reads the content of the parent file and inserts into the database. I can create thousand processes running in parallel and insert all the records in parallel but I am not sure what will be the impact on the relational database as I will be inserting millions of records in parallel. Also I am not sure if relational database is the right approach to chose here. Could you please let me know if there is any tool or technology that better suits this problem. I started with sqlite but it did not support concurrent inserts and failed with database lock error. And Now I want to try MySQL or any other alternate solution that suits the situation.
Sample Code that runs as thousand processes in parallel to insert into MySQL:
import MySQLDb
connection = MySQLDb.connect(host, username,...)
cursor = connection.cursor()
with open(some_set) as fd:
for each_parent_file in fd:
with open(each_parent_file) as parent_fd:
for each_line in parent_fd:
child_file_name, *line_numbers = each_line.strip().split(",")
insert_items = [(each_parent_file, child_file_name, line_num) for line_num in line_numbers]
cursor.executemany("INSERT INTO ParentChildMapping (ParentFileName, ChildFileName, LineNumber) VALUES %s" %insert_items)
cursor.commit()
cursor.close()
connection.close()
Let's start with a naïve idea of what a database would need to do to organize your data.
You have a million parent files.
Each one contains thousands of child files. Let's say 10,000.
Each one contains a list of line numbers. You didn't say how many. Let's say 100.
This is 10^6 * 10^4 * 10^2 = 10^12 records. Suppose that each is 50 bytes. This is 50 terabytes of data. We need it organized somehow, so we sort it. This requires on the order of log_2(10^12) which is around 40 passes. This naïve approach needs is 2 * 10^15 of data. If we do this in a day with 86400 seconds, this needs us to process 23 GB of data per second.
Your hard drive probably doesn't have 50 terabytes of space. Even if it did, it probably doesn't stream data faster than about 500 MB/second, which is 50 times too slow.
Can we improve this? Well, of course. Probably half the passes can happen strictly in memory. You can replace records with 12 byte tuples. There are various ways to compress this data. But the usual "bulk insert data, create index" is NOT going to give you the desired performance on a standard relational database approach.
Congratulations. When people talk about #bigdata, they usually have small data. But you actually have enough that it matters.
So...what can you do?
First what can you do with out of the box tools?
If one computer doesn't have horsepower, we need something distributed. We need a distributed key/value store like Cassandra. We'll need something like Hadoop or Spark to process data.
If we have those, all we need to do is process the files and load them into Cassandra as records, by parent+child file, of line numbers. We then do a map reduce to find, by child+line number of what parent files have it and store that back into Cassandra. We then get answers by querying Cassandra.
BUT keep in mind the back of the envelope about the amount of data and processing required. This approach allows us, with some overhead, to do all of that in a distributed way. This allows us to do that much work and store that much data in a fixed amount of time. However you will also need that many machines to do it on. Which you can easily rent from AWS, but you'll wind up paying for them as well.
OK, suppose you're willing to build a custom solution, can you do something more efficient? And maybe run it on one machine? After all your original data set fits on one machine, right?
Yes, but it will also take some development.
First, let's make the data more efficient. An obvious step is to create lookup tables for file names to indexes. You already have the parent files in a list, this just requires inserting a million records into something like RocksDB for the forward lookup, and the same for the reverse. You can also generate a list of all child filenames (with repetition) then use Unix commands to do a sort -u to get canonical ones. Do the same and you get a similar child file lookup.
Next, the reason why we were generating so much data before is that we were taking a line like:
s1.c,1,2,3,4,5
and were turning it into:
s1.c,1,/root/a/b/c/1.txt
s1.c,2,/root/a/b/c/1.txt
s1.c,3,/root/a/b/c/1.txt
s1.c,4,/root/a/b/c/1.txt
s1.c,5,/root/a/b/c/1.txt
But if we turn s1.c into a number like 42, and /root/a/b/c/1.txt into 1, then we can turn this into something like this:
42,1,1,5
Meaning that child file 42, parent file 1 starts on line 1 and ends on line 5. If we use, say, 4 bytes for each field then this is a 16 byte block. And we generate just a few per line. Let's say an average of 2. (A lot of lines will have one, others may have multiple such blocks.) So our whole data is 20 billion 16 byte rows for 320 GB of data. Sorting this takes 34 passes, most of which don't need to be written to disk, which can easily be inside of a day on a single computer. (What you do is sort 1.6 GB blocks in memory, then write them back to disk. Then you can get the final result in 8 merge passes.)
And once you have that sorted file, you can NOW just write out offsets to where every file happens.
If each child file is in thousands of parent files, then decoding this is a question of doing a lookup from filename to child file ID, then a lookup of child file ID to the range which has that child file listed. Go through the thousand of records, and form a list of the thousands of parent files that had the line number in their range. Now do the lookup of their names, and return the result. This lookup should run in seconds, and (since everything is readonly) can be done in parallel with other lookups.
BUT this is a substantial amount of software to write. It is how I would go. But if the system only needs to be used a few times, or if you have additional needs, the naïve distributed solution may well be cost effective.

Starvation of one of 2 streams in ConnectedStreams

Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?
It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.

Determine percentage of unused keys in large redis DB

I have a Redis database with many millions of keys in it. Over time, the keys that I have written to and read from have changed, and so there are many keys that I am simply not using any more. Most don't have any kind of TTL either.
I want to get a sense for what percentage of the keys in the Redis database is not in use any more. I was thinking I could use hyperloglog to estimate the cardinality of the number of keys that are being written to, but it seems like a lot of work to do a PFADD for every key that gets written to and read from.
To be clear, I don't want to delete anything yet, I just want to do some analysis on the number of used keys in the database.
I'd start with the scan command to iterate through the keys, and use the object idletime command on each to collect the number of seconds since the key was last used. From there you can generate metrics however you like.
One way, using Redis, would be to use a sorted set with the idletime of the key as its score. The advantage of this over HLL is that you can then say "give me keys idle between x and y seconds ago" by using zrange and/or zrevrange. The results of that you could then use for operations such as deletion, archival, or setting a TTL. With HLL you can't do this.
Another advantage is that, unless you store the result in Redis, there is only a Redis cost when you run it. You don't have to modify your code to do additional operations when accessing keys, for example.
The accuracy of the object's idle time is around ten seconds or so if I recall. But for getting an idea of how many and which keys haven't been accessed in a given time frame it should work fine.
You can analysis the data with time window, and use a hyperloglog to estimate the cardinality for each time window.
For example, you can use a hyperloglog for each day's analysis:
// for each key that has been read or written in day1
// add it to the corresponding hyperloglog
pfadd key-count-day1 a b
pfadd key-count-day1 c d e
// for each key that has been read or written in day2
// add it to the corresponding hyperloglog
pfadd key-count-day2 a
pfadd key-count-day2 c
In this case, you can get the estimated number of keys that are active in dayN with the hyperloglog whose key is key-count-dayN.
With pfcount, you can get the number of active keys for each day or several days.
// number of active keys in day2: count2
pfcount key-count-day2
// number of active keys in day1 and day2: count-total
pfcount key-count-day1 key-count-day2
With these 2 counts, you can calculate the percentage of keys that are unused since day2: (count-total - count2) / count-total

graph database physical distribution and indexing

My question is not on the query language but on the physical distribution of data in a graph database.
Let's assume a simple user/friendship model. In RDBs you would create a table storing IDUserA/IDUserB for a representation of a friendship.
If we assume a bunch of IT-Girls for example with the Facebook limit of 5k friends, we quickly get to huge amounts of data. If GirlA(ID 1) simply likes GirlB(ID 2). It would be an entry wir [1][2] in the table.
With this model it is not possible to get over data redundancy in friendship, because then we have to do either two queries (is there an entry in IDUserA or an entry in IDUserB with ID = 1, what means physically searching both columns) or to store [1][2] and [2][1], what ends up in data redundancy. For a heavy user this means checks against 5000/10000 entries containing an indexed column, which is astronomically big.
So ok, use GraphDBs. We assume the Girls as Nodes. GirlA is the first one ever entered into the DB, so her ID is simply 0. The Entry contains a isUsed - flag for the data chunk of a byte, and is 1 if it is in use. The next 4 bytes are a flag for the filename where her node is stored in (what leads to nearly 4.3 Billion possible files and we assume the file size of 16.7MB so we could use 3 more bytes to declare the offset inside.
Lets assume we define the username datatype as a chunk of 256 (and be for the example so ridgid).
For GirlA it is [1]0.0.0.0-0.0.0
= Her User ID 0 times 256 = 0
For GirlB it is [1]0.0.0.0-0.1.0
= Her User ID 1 times 256 = 256,
so her Usernamedata starts on file 0_0_0_0.dat on offset 256 from start. We don't have to search for her data, we could simply calculate them. A User 100 would be stored in the same file on offset 25600 and so forth and so on. User 65537 would be stored in file 0_0_0_1.dat on offset 0. Loaded in RAM this is only a pointer and pretty fast.
So we could store with this method more nodes than humans ever lived.
BUT: How to find relationships? Ok, with edges. But how to store them? All in one "column" is stupid, because then we are back on relationship models. In a hashtable? Ok, we could store the 0_0_0_0.frds as a hashtable containing all friends of User0, kick off a new instance of a User-Class Object, add the Friends to a binary list or tree that could be found by the pointer cUser.pFriendlist and we would be done. But I think that I make a mistake.
Shouldn't GraphDatabases be something different than mathematical nodes connected with hash tables filled with edges?
The use of nodes and edges is clear, because it allows to connect everything with relationships of anything. But whats about the queries and their speed?
Keeping different edges in different type of files seems somekind of wrong, even if the accessibility is really fast on SSDs.
Sure, I could use a simple relational table to store a edgetype/dataending pair, but please help me: where do I get it wrong!

C - How can this simple transaction() function be completely free of deadlocks?

So I have this basic transaction() function written in C:
void transaction (Account from, Account to, double amount) {
mutex lock1, lock2;
lock1 = get_lock(from);
lock2 = get_lock(to);
acquire(lock1);
acquire(lock2);
withdraw(from, amount);
deposit(to, amount);
release(lock2);
release (lock1);
}
It's to my understanding that the function is mostly deadlock-free since the function locks one account and then the other (instead of locking one, making changes, and then locking another). However, if this function was called simultaneously by these two calls:
transaction (savings_account, checking_account, 500);
transaction (checking_account, savings_account, 300);
I am told that this would result in a deadlock. How can I edit this function so that it's completely free of deadlocks?
You need to create a total ordering of objects (Account objects, in this case) and then always lock them in the same order, according to that total ordering. You can decide what order to lock them in, but the simple thing would be to first lock the one that comes first in the total ordering, then the other.
For example, let's say each account has an account number, which is a unique* integer. (* meaning no two accounts have the same number) Then you could always lock the one with the smaller account number first. Using your example:
void transaction (Account from, Account to, double amount)
{
mutex first_lock, second_lock;
if (acct_no(from) < acct_no(to))
{
first_lock = get_lock(from);
second_lock = get_lock(to);
}
else
{
assert(acct_no(to) < acct_no(from)); // total ordering, so == is not possible!
assert(acct_no(to) != acct_no(from)); // this assert is essentially equivalent
first_lock = get_lock(to);
second_lock = get_lock(from);
}
acquire(first_lock);
acquire(second_lock);
withdraw(from, amount);
deposit(to, amount);
release(second_lock);
release(first_lock);
}
So following this example, if checking_account has account no. 1 and savings_account has account no. 2, transaction (savings_account, checking_account, 500); will lock checking_account first and then savings_account, and transaction (checking_account, savings_account, 300); will also lock checking_account first and then savings_account.
If you don't have account numbers (say your working with class Foo instead of class Account) then you need to find something else to establish a total ordering. If each object has a name, as a string, then you can do an alphabetic comparison to determine which string is "less". Or you can use any other type that is comparable for > and <.
However, it is very important that the values be unique for each and every object! If two objects have the same value in whichever field you're testing, then they in the same spot in the ordering. If that can happen, then it is a "partial ordering" not a "total ordering" and it is important to have a total ordering for this locking application.
If necessary, you can make up a "key value" that is an arbitrary number that doesn't mean anything, but is guaranteed unique for each object of that type. Assign a new, unique value to each object when it is created.
Another alternative is to keep all the objects of that type in some kind of list. Then their list position serves to put them in a total ordering. (Frankly, the "key value" approach is better, but some applications may be keeping the objects in a list already for application logic purposes so you can leverage the existing list in that case.) However, take care that you don't end up taking O(n) time (instead of O(1) like the other approaches*) to determine which one comes first in the total ordering when you use this approach.
(* If you're using a string to determine total ordering, then it's not really O(1), but it's linear with the length of the strings and constant w.r.t. the number of objects that hold those strings... However, depending on your application, the string length may be much more reasonably bounded than the numer of objects.)
The problem you are trying to solve is called the dining philosophers problem, it is a well known concurrency problem.
In your case the naive solution would be to change acquire to receive 2 parameters(to and from) and only return when it can get both locks at the same time and to not get any lock if it can't have both (because that's the situation when the deadlock may occur, when get 1 lock and wait for the other). Read about the dining philosophers problem and you'll understand why.
Hope it helps!

Resources