Akka-Stream using Partition stage - akka-stream

I am trying to find an example that shows how to use, Akka Stream Partition capability. I am trying to optimize writing a Store. For that i would like to group item in a batch of my choosing and write them at once, however to make use of parallelism i would like to do it in parallel. Hence the batch and then the writing in the database would happen let say on 8 simultaneous threads.
I would be able to write 30 records as one update on 8 parallel threads.
Grouped or GroupedWithin are well documented and i have tried them with success. However, the only thing i see to do what i want to in parallel is actually a partitioning. I'd to partition randomly, i don't care about how it is partition. If there is facilities to like round-robin partitioner, i'd like to know as well.
But first and foremost how do I use partition. Do i have to build a GraphBuilder ?
Please help

Usage example of Partition can be found here.
If you're looking for random partitioning, the Balance stage is what you need. The docs provide an example here.
In both cases you need to make use of the GraphDSL, documented here.

Related

Concurrent queries in PostgreSQL - what is actually happening?

Let us say we have two users running a query against the same table in PostgreSQL. So,
User 1: SELECT * FROM table WHERE year = '2020' and
User 2: SELECT * FROM table WHERE year = '2019'
Are they going to be executed at the same time as opposed to executing one after the other?
I would expect that if I have 2 processors, I can run both at the same time. But I am thinking that matters become far more complicated depending on where the data is located (e.g. disk) given that it is the same table, whether there is partitioning, configurations, transactions, etc. Can someone help me understand how I can ensure that I get my desired behaviour as far as PostgreSQL is concerned? Under which circumstances will I get my desired behaviour and under which circumstances will I not?
EDIT: I have found this other question which is very close to what I was asking - https://dba.stackexchange.com/questions/72325/postgresql-if-i-run-multiple-queries-concurrently-under-what-circumstances-wo. It is a bit old and doesn't have much answers, would appreciate a fresh outlook on it.
If the two users have two independent connections and they don't go out of their way to block each other, then the queries will execute at the same time. If they need to access the same buffer at the same time, or read the same disk page into a buffer at the same time, they will use very fast locking/coordination methods (LWLocks, spin locks, or atomic operations like CAS) to coordinate that. The exact techniques vary from version to version, as better methods become widely available on supported platforms and as people find the time to change the implementation to use those better methods.
I can ensure that I get my desired behaviour as far as PostgreSQL is concerned?
You should always get the correct answer to your query (Or possibly some kind of ERROR indicating a failure to serialize if you are using the highest (and non-default) isolation level, but that doesn't seem to be a risk if each of those queries is run in a single-statement transaction.)
I think you are overthinking this. The point of using a database management system is that you don't need to micromanage it.
Also, "parallel-query" refers to a single query using multiple CPUs, not to different queries running at the same time.

Couchbase retrieve data from vbucket

I'm new to Couchbase and wondering if there is any manner to implement a parallel read from bucket. Given that, a bucket contains 1024 vbuckets by default. So could it be possible to split a N1QL query select * from b1 into several queries? It means that one of those queries just reads data from vbucket1 to vbucket100. Because the partition key is used to decide which node the value should be persisted. I think it could be possible to read a part of data from bucket according to a range of partition key. Could someone help me out of this?
Thanks
I don't recommend proceeding down this route. If you are just starting out, you should be worrying about how to represent your data in JSON, how to write effective N1QL queries against it, and how to get a useful set of indexes that support those queries and let them run quickly. You should also make sure that your cluster is properly set up, and you have a proper mix of KV, N1QL, and indexing nodes, with none of them as an obvious bottleneck. And of course you should be measuring performance. Exotic strategies like query partitioning should come after that, if you are still unsatisfied with performance.

Unexpected scheduler behaviour

In a simple workflow, which is as far as I can tell embarrassingly parallel (please correct me), I observe a strange order of execution by the dask.distributed (and presumably the multiprocessing) scheduler.
To illustrate the process, I have set up a similar problem with 5 instead of 60000 partitions, which yields the following dask graph:
The yellow boxes are 'from_delayed' in the real case.
The underlying workflow is as follows:
Read in data
Merge the resulting dask dataframe with a pandas dataframe. As far as I can tell from the dask documentation, this should be a "Fast and common case".
Select data based on the result
Apply a function on each partition
To my surprise, all the data is read in in the first step, consuming approximately 450GB out of 500GB memory. After this, the lambda function is applied in parallel, but it is not using all workers, apparently, only 2-3 at a time are active. Maybe the scheduler hesitates to use more cores because the memory is almost full? When I run the code on smaller problems (~400 partitions), the data is still loaded to memory completely, but afterwards the execution uses all available cores. I've tried to use repartition in the very large case, but it did not seem to have an impact.
I investigated this using the distributed scheduler, but it seems to be the same for the multi threaded scheduler.
Maybe the 'merge'-step causes the bottleneck? Are there any other obvious bottlenecks in the given graph?

Alternative to map in marklogic

In my application I need to feed billions of entries in map which is taking much time for execution, can there be any other alternative to map which takes less execution time?
With billions of anything, the answer is probably to put them into the database and work with them using range indexes. It might be appropriate to use https://github.com/marklogic/semantic/ or at least borrow some of its concepts.
I concur with Justin that more details would help to give a more accurate answer..
In general the main issue with map:map is that it needs to be initialized at each execution of the module. With many entries it is wise to store the map:map entirely somewhere, to speed this up. You could put it in a server-field, so it would only need to be recalculated after a restart. You could also store it in a database, but that would require a database round-trip to retrieve it.
However, a map:map with a billion entries might not perform well at all. As an alternative you could store each entry as a separate document in the database. MarkLogic can handle that very well. You can use cts functions to retrieve appropriate entries. Indexes are kept in memory to make using them very fast..
HTH!

app engine data pipelines talk - for fan-in materialized view, why are work indexes necessary?

I'm trying to understand the data pipelines talk presented at google i/o:
http://www.youtube.com/watch?v=zSDC_TU7rtc
I don't see why fan-in work indexes are necessary if i'm just going to batch through input-sequence markers.
Can't the optimistically-enqueued task grab all unapplied markers, churn through as many of them as possible (repeatedly fetching a batch of say 10, then transactionally update the materialized view entity), and re-enqueue itself if the task times out before working through all markers?
Does the work indexes have something to do with the efficiency querying for all unapplied markers? i.e., it's better to query for "markers with work_index = " than for "markers with applied = False"? If so, why is that?
For reference, the question+answer which led me to the data pipelines talk is here:
app engine datastore: model for progressively updated terrain height map
A few things:
My approach assumes multiple workers (see ShardedForkJoinQueue here: http://code.google.com/p/pubsubhubbub/source/browse/trunk/hub/fork_join_queue.py), where the inbound rate of tasks exceeds the amount of work a single thread can do. With that in mind, how would you use a simple "applied = False" to split work across N threads? Probably assign another field on your model to a worker's shard_number at random; then your query would be on "shard_number=N AND applied=False" (requiring another composite index). Okay that should work.
But then how do you know how many worker shards/threads you need? With the approach above you need to statically configure them so your shard_number parameter is between 1 and N. You can only have one thread querying for each shard_number at a time or else you have contention. I want the system to figure out the shard/thread count at runtime. My approach batches work together into reasonably sized chunks (like the 10 items) and then enqueues a continuation task to take care of the rest. Using query cursors I know that each continuation will not overlap the last thread's, so there's no contention. This gives me a dynamic number of threads working in parallel on the same shard's work items.
Now say your queue backs up. How do you ensure the oldest work items are processed first? Put another way: How do you prevent starvation? You could assign another field on your model to the time of insertion-- call it add_time. Now your query would be "shard_number=N AND applied=False ORDER BY add_time DESC". This works fine for low throughput queues.
What if your work item write-rate goes up a ton? You're going to be writing many, many rows with roughly the same add_time. This requires a Bigtable row prefix for your entities as something like "shard_number=1|applied=False|add_time=2010-06-24T9:15:22". That means every work item insert is hitting the same Bigtable tablet server, the server that's currently owner of the lexical head of the descending index. So fundamentally you're limited to the throughput of a single machine for each work shard's Datastore writes.
With my approach, your only Bigtable index row is prefixed by the hash of the incrementing work sequence number. This work_index value is scattered across the lexical rowspace of Bigtable each time the sequence number is incremented. Thus, each sequential work item enqueue will likely go to a different tablet server (given enough data), spreading the load of my queue beyond a single machine. With this approach the write-rate should effectively be bound only by the number of physical Bigtable machines in a cluster.
One disadvantage of this approach is that it requires an extra write: you have to flip the flag on the original marker entity when you've completed the update, which is something Brett's original approach doesn't require.
You still need some sort of work index, too, or you encounter the race conditions Brett talked about, where the task that should apply an update runs before the update transaction has committed. In your system, the update would still get applied - but it could be an arbitrary amount of time before the next update runs and applies it.
Still, I'm not the expert on this (yet ;). I've forwarded your question to Brett, and I'll let you know what he says - I'm curious as to his answer, too!

Resources