Alpakka Cassandra Flow Usage - akka-stream

I am reading the documentation of Alpakka Cassandra here
It makes it very easy to use Cassandra as a source and a sink. but what about flow usage.
By flow usage I mean is that I am not using Cassandra as a source or a sink. but to Lookup Data.
is that possible using Alpakka? or should I have to write the Cassandra jdbc code in a flow myself?

1) Sink. If you check Alpakka's source code, you'll find that the Sink is constructed as follows
Flow[T]
.mapAsyncUnordered(parallelism)(t ⇒ session.executeAsync(statementBinder(t, statement)).asScala())
.toMat(Sink.ignore)(Keep.right)
if you only need a passing flow, you can always trim out the Sink.ignore part, and you'll have
Flow[T]
.mapAsyncUnordered(parallelism)(t ⇒ session.executeAsync(statementBinder(t, statement)).asScala())
You will only need to expose the Guava futures converter, which is currently package private in Alpakka.
2) Source. You can always obtain a Flow from a Source by doing .flatMapConcat(x => CassandraSource(...))

Related

Flink filter before partition

Apache Flink uses DAG style lazy processing model similar to Apache Spark (correct me if I'm wrong). That being said, if I use following code
DataStream<Element> data = ...;
DataStream<Element> res = data.filter(...).keyBy(...).timeWindow(...).apply(...);
.keyBy() converts DataStream to KeyedStream and distributes it among Flink worker nodes.
My question is, how will flink handle filter here? Will filter be applied to incoming DataStream before partitioning/distributing the stream and DataStream will only be created of Element's that pass the filter criteria?
Will filter be applied to incoming DataStream before partitioning/distributing the stream and DataStream will only be created of Element's that pass the filter criteria?
Yes, that's right. The only thing I might say differently would be to clarify that the original stream data will typically already be distributed (parallel) from the source. The filtering will be applied in parallel, across multiple tasks, after which the keyBy will then reparition/redistribute the stream among the workers.
You can use Flink's web UI to examine a visualization of the execution graph produced from your job.
From my understanding filter is applied before the keyBy. As you said it is a DAG (D == Directed). Do you see any indicators which tells you that this is not the case?

can I use KSQL to generate processing-time timeouts?

I am trying to use KSQL to do whatever processing I can within a time limit and get the results at that time limit. See Timely (and Stateful) Processing with Apache Beam under "Processing Time Timers" for the same idea illustrated using Apache Beam.
Given:
A stream of transactions with unique keys;
Updates to these transactions in the same stream; and
A downstream processor that wants to receive the updated transactions at a specific timeout - say 20 seconds - after the transactions appeared in the first stream.
Conceptually, I was thinking of creating a KTable of the first stream to hold the latest state of the transactions, and using KSQL to create an output stream by querying the KTable for keys with (create_time + timeout) < current_time. (and adding the timeouts as "updates" to the first stream so I could filter those out from the KTable)
I haven't found a way to do this in the KSQL docs, and even if there were a built-in current_time, I'm not sure it would be evaluated until another record came down the stream.
How can I do this in KSQL? Do I need a custom UDF? If it can't be done in KSQL, can I do it in KStreams?
=====
Update: It looks like KStreams does not support this today - Apache Flink appears to be the way to go for this use case (and many others). If you know of a clever way around KStreams' limitations, tell me!
Take a look at the punctuate() functionality in the Processor API of Kafka Streams, which might be what you are looking for. You can use punctuate() with stream-time (default: event-time) as well as with processing-time (via PunctuationType.WALL_CLOCK_TIME). Here, you would implement a Processor or a Transformer, depending on your needs, which will use punctuate() for the timeout functionality.
See https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html for more information.
Tip: You can use such a Processor/Transformer also in the DSL of Kafka Streams. This means you can keep using the more convenient DSL, if you like to, and only need to plug in the Processor/Transformer at the right place in your DSL-based code.

Via/ViaMat/to/toMat in Akka Stream

Can someone explain clearly what are the difference between those 4 methods ? When is it more appropriate to use each one ? Also generally speaking what is the name of this Group of method? Are there more method that does the same job ? A link to the scaladoc could also help.
-D-
All these methods are necessary to join two streams into one stream. For example, you can create a Source out of a Source and a Flow, or you can create a Sink out of a Flow and a Sink, or you can create a Flow out of two Flows.
For this, there are two basic operations, to and via. The former allows one to connect either a Source or a Flow to a Sink, while the latter allows to connect a Source or a Flow to a Flow:
source.to(sink) -> runnable graph
flow.to(sink) -> sink
source.via(flow) -> source
flow1.via(flow2) -> flow
For the reference, a runnable graph is a fully connected reactive stream which is ready to be materialized and executed.
*Mat versions of various operations allow one to specify how materialized values of streams included in the operation should be combined. As you may know, each stream has a materialized value which can be obtained when the stream is materialized. For example, Source.queue yields a queue object which can be used by another part of your program to emit elements into the running stream.
By default to and via on sources and flows only keeps the materialized value of the stream it is called on, ignoring the materialized value of its argument:
source.to(sink) yields mat.value of source
source.via(flow) yields mat.value of source
flow.to(sink) yields mat.value of flow
flow1.via(flow2) yields mat.value of flow1
Sometimes, however, you need to keep both materialized values or to combined them somehow. That's when Mat variants of methods are needed. They allow you to specify the combining function which takes materialized values of both operands and returns a materialized value of the combined stream:
source.to(sink) equivalent to source.toMat(sink)(Keep.left)
flow1.via(flow2) equivalent to flow1.viaMat(flow2)(Keep.left)
For example, to keep both materialized values, you can use Keep.both method, or if you only need the mat.value of the "right" operand, you can use Keep.right method:
source.toMat(sink)(Keep.both) yields a tuple (mat.value of source, mat.value of sink)

Ada + PolyORB : can I have multiple instances of the same RCI unit and be able to call a specific one from code?

I'm developing a distributed application using Ada's Distributed System Annex and PolyORB and I have a somewhat peculiar request.
Let's say I have a RCI (Remote Call Interface) unit called U and the main program called MAIN that uses it.
What I want to know is:
Can I configure DSA to create multiple copies of the partition U ?
If the answer is yes, can I then call a specific one of these
partitions from my code in MAIN?
I can't find info on this online, right now the only solution I can think of would be to have a pre-processor generate multiple "copies" of U from a generic template and patch the DSA configuration file accordingly. Is there a less "hacky" way to do this?
After reading all the documentation I could find, I conclude that this is not possible, ie: it's not contemplated in DSA.
I still think the hacky solution I proposed in my original question would work, however it doesn't make much sense from a practical point of view.
For my application, I decided to switch to SOAP, using the AWS web server and ditching DSA altogether.
EDIT: in response to shark8 comment:
My idea, in detail, was this. Let's assume I have some kind of configuration file in which I describe how many copies of the partitions I want and how each one is configured, here's how I'd do it:
Have a pre-processor read the configuration file and generate source files for all the needed
partitions from a template (basically this would just create a copy
of the files and alter the package name and file name to avoid
collisions)
Have the pre-processor patch the DSA configuration file accordingly, to generate those partitions
I would then have one final package that acts as a "gateway", or a "selector" if you
want, to the other packages. It would contain "generic" procedures
that would act as selectors, something like:
procedure Some_Procedure(partitionID : Integer; params : whatever) is
case partitionID is
when 1 =>
Partition1.Some_Procedure(params);
when 2 =>
Partition2.Some_Procedure(params);
etc...
end case;
end Some_Procedure;
Obviously, like the partitions themselves, the various lines in the case will have to be generated by the pre-processor. Also: the main program that calls this procedure needs to pass it's ID, but this is no problem since it will have read the same configuration as the preprocessor did at 1) so it would know the IDs and which one it wants to call, depending on its internal logic.
As I said, it's a tortuous method, but I think it should work.
The DSA (polyorb) is very flexible and you can have multiple unique RCI . Why would you want to do this? is it a maximisation of processing speed. Having used the DSA quite a bit I would question your motives for doing this in this way and perhaps suggest that you change your design.
If its processing speed I would suggest you do your processing in multiple client partitions, and use the asynchronous messaging mechanism in the DSA. If you need more processing speed, then do your calculation/ processing in multiple stages across client partitions.
Theres a great example of this messaging mechanism in the banking example application included in the DSA examples.

Key/Value database for storing binary data

I am looking for a lightweight, reliable and fast key/value database for storing binary data. Simple without server. Most of popular key/value databases like CDB and BerkeleyDB does not natively store BLOB. What can be the best choice that I missed?
My current choice is SQLite, but it is too advanced for my simple usage.
As it was previously pointed out, BerkeleyDB does support opaque values and keys, but I would suggest a better alternative: LevelDB.
LevelDB:
Google is your friend :), so much so that they even provide you with an embedded database: A fast and lightweight key/value database library by Google.
Features:
Keys and values are arbitrary byte arrays.
Data is stored sorted by key.
Callers can provide a custom comparison function to override the sort order.
The basic operations are Put(key,value), Get(key), Delete(key).
Multiple changes can be made in one atomic batch.
Users can create a transient snapshot to get a consistent view of data.
Forward and backward iteration is supported over the data.
Data is automatically compressed using the Snappy compression library.
External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
Detailed documentation about how to use the library is included with the source code.
What makes you think BerkDB cannot store binary data? From their docs:
Key and content arguments are objects described by the datum typedef. A datum specifies a string of dsize bytes pointed to by dptr. Arbitrary binary data, as well as normal text strings, are allowed.
Also see their examples:
money = 122.45;
key.data = &money;
key.size = sizeof(float);
...
ret = my_database->put(my_database, NULL, &key, &data, DB_NOOVERWRITE);
If you don't need "multiple writer processes" (only multiple readers works), want something small and want something that is available on nearly every linux, you might want to take a look at gdbm, which is like berkeley db, but much simpler. Also it's possibly not as fast.
In nearly the same area are things like tokyocabinet, qdbm, and the already mentioned leveldb.
Berkeley db and sqlite are ahead of those, because they support multiple writers. berkeley db is a versioning desaster sometimes.
The major pro of gdbm: It's already on every linux, no versioning issues, small.
Which OS are you running? For Windows you might want to check out ESE, which is a very robust storage engine which ships as part of the OS. It powers Active Directory, Exchange, DNS and a few other Microsoft server products, and is used by many 3rd party products (RavenDB comes to mind as a good example).
You may have a look at http://www.codeproject.com/Articles/316816/RaptorDB-The-Key-Value-Store-V2 my friend of Databases.
Using sqlite is now straightforward with the new functions readfile(X) and writefile(X,Y) which are available since 2014-06. For example to save a blob in the database from a file and extract it again
CREATE TABLE images(name TEXT, type TEXT, img BLOB);
INSERT INTO images(name,type,img) VALUES('icon','jpeg',readfile('icon.jpg'));
SELECT writefile('icon-copy2.jpg',img) FROM images WHERE name='icon';
see https://www.sqlite.org/cli.html "File I/O Functions"

Resources