Apache Giraph Graph Partitioning.... Can a partition p1 resides partially in worker w1 and partially in worker w2? - giraph

I am newbie in Apache Giraph. My question is related to Giraph graph partitioning. As far as I know, Giraph partition the large graph randomly.... possibly #partitions>#workers in order to load balance. But, my question is, is #partitions/worker always an integer? Saying in the ther way, Can it happen, that a partition (say p1) resides partially in worker w1 and worker w2? Or, should p1 be either in w1 or w2 at entirety?

Partition in Giraph refers to vertex partition not graph-partitions. For example, if a graph has 10 vertices numbered from 1 to 10 then a possible partition would be {1,2. 3}, {4,5,6}, {7,8,9,10}. Each partition knows where its outgoing edges are pointing. Each worker creates threads for each partition which is assigned to it. The thread iterates over each vertex in the partition and executes compute function.
So with this information I would say a partition has to reside on a single worker entirely.
Hello #zahorak,
If Giraph implemented Pregel as it is, then as per the Pregel paper it is not necessary to have #partitions == #workers. It says,
The master determines how many partitions the graph will have, and assigns one or more partitions to each worker machine. The number may be controlled by the user. Having more than one partition per worker allows parallelism among the partitions and better load balancing, and will usually improve performance.
UPDATE: I found the similar question on Giraph user mailing list. The answers given in replies might be helpful. Here is the link to the thread - https://www.mail-archive.com/user#giraph.apache.org/msg01869.html

AFAIK no, actually I would have said, #partitions == #workers
The reason for partitioning is to handle parts of the graph on one server. After the superstep is executed messages sent to other partitions are exchanged between the servers within a cluster.
Maybe you understand something else under the term partitioning as me, but for me partitioning means:
Giraph is on a cluster with multiple servers, in order to laverage all servers, it needs to partition the graph. And than in simply assings randomly a node to one of the n servers. Out of this you get n partitions and nodes within each partition are executed by the one server they were assigned to, no other.

Related

ClickHouse - How to remove node from cluster for reading?

Background
I'm beginning work to set up a ClickHouse cluster with 3 CH nodes. The first node (Node A) would be write-only, and the remaining 2 (Nodes B + C) would be read-only. By this I mean that writes for a given table to Node A would automatically replicate to Nodes B + C. When querying the cluster, reads would only be resolved against Nodes B + C.
The purpose for doing this is two-fold.
This datastore serves both real-time and background jobs. Both are high volume, only on the read side, so it makes sense to segment the traffic. Node A would be used for writing to the cluster and all background reads. Nodes B + C would be strictly used for the UX.
The volume of writes is very low, perhaps 1 write per 10,000 reads. Data is entirely refreshed once per week. Background jobs need to be certain that the most current data is being read before they can be kicked off. Reading off of replicas introduces eventual consistency as a concern, so reading from the node directly (rather than the cluster) from Node A guarantees the data to be strongly consistent.
Question
I'm not finding much specific information in the CH documentation, and am wondering whether this might be possible. If so, what would the cluster configuration look like?
Yes, it is possible to do so. But wouldn't the best solution be to read and write to each server sequentially using the Distributed table?

How does Multi-Raft group nodes together?

I am trying to implement a similar architecture to Cockroachdb's multi-raft: https://www.cockroachlabs.com/blog/scaling-Raft/.
Does anyone have a brief explanation to how Multi-Raft group these individual Raft clusters? Specifically, is there a Raft instance coordinating the membership of the servers participating in each range/session/unit of smaller Raft units?
If I had to guess, they implement a logic similar to rendezvous hashing (or consistent hashing).
For example, there are 10 nodes total. And there is a range X. They could use hashing to decide what are those 3 nodes responsible for the range. One of those nodes will be the leader for a given range. And for other range, nodes will be different.
Since there are many more ranges than nodes, it means that every node will participate in multiple ranges. This is cool, as when a node sends a heart beat to a follower - that heart beat will confirm all ranges where a given node is a leader and the other node is a follower.
At the end of the day, still every node sends a heartbeat to every other node.

How Row Key is designed in Hbase

I am writing a program that converts an RDBMS into HBase. I selected a sequential entity as a row key like Employee ID (1,2,3....)but i read it somewhere that row key shouldn't be a sequential entity. My question is why selecting a sequential row key is not recommended. what are the design prospects associated for doing the same?
Although sequential rowkeys allow faster scans, it becomes a problem after a certain point as it causes undesirable RegionServer hotspotting during read/write time. By its default behavior Hbase stores rows with similar keys to the same region. It allows faster range scans. So if rowkeys are sequential all of your data will start going to the same machine causing uneven load on that machine. This is called as RegionServer Hotspotting and is the main motivation behind not using sequential keys. I'll take "writes" to explain the problem here.
When records with sequential keys are being written to HBase all writes hit one Region. This would not be a problem if a Region was served by multiple RegionServers, but that is not the case – each Region lives on just one RegionServer. Each Region has a pre-defined maximal size, so after a Region reaches that size it is split in two smaller Regions. Following that, one of these new Regions takes all new records and then this Region and the RegionServer that serves it becomes the new hotspot victim. Obviously, this uneven write load distribution is highly undesirable because it limits the write throughput to the capacity of a single server instead of making use of multiple/all nodes in the HBase cluster.
You can find a very good explanation of the problem along with its solution here.
You might also find this page helpful, which shows us how to design rowkeys efficiently.
Hope this answers your question.
Mostly because sequentially increasing row keys will be written to the same region, and not evenly distributed in terms of writes. If you have a write-intensive application, it makes sense to have some randomness in your row-key.
This is a great explanation (with graphics) on why a sequentially increasing row-key is a bad idea for HBase.

Multiple instances of a service processing rows from a single table, what's the best way to prevent collision

We are trying to build a system with multiple instances of a service on different machines that share the load of processing.
Each of these will check a table, if there are rows to be processed on that table, it will pick the first, mark it processing, then process it, then mark it done. Rinse repeat.
What is the best way to prevent a racing condition where 2 instances A and B do the following
A (1) read the table, finds row 1 to process,
B (1) reads the table, finds row 1 to process,
A (2) marks it row processing
B (2) Marks it row processing
In a single app we could use locks or mutexs.
I can just put A1 and A2 in a single transaction, is it that simple, or is there a better, faster way to do this?
Should I just turn it on it's head so that the steps are:
A (1) Mark the next row as mine to process
A (2) Return it to me for processing.
I figure this has to have been solved many times before, so I'm looking for the "standard" solutions, and if there are more than one, benefits and disadvantages.
Transactions are a nice simple answer, with two possible drawbacks:
1) You might want to check with the fine print of your database. Sometimes the default consistency settings don't guarantee absolute consistency in every possible circumstance.
2) Sometimes the pattern of accesses associated with using a database to queue and distribute work is hard on a database that isn't expecting it.
One possibility is to look at reliable message queuing systems, which are seem to pretty good match to what you are looking for - worker machines could just read work from a shared queue. Possible jumping-off points are http://en.wikipedia.org/wiki/Message_queue and http://docs.oracle.com/cd/B10500_01/appdev.920/a96587/qintro.htm

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Resources