Connecting SourceShape/PortOpts from a UniformFanOutShape to sources in Akka Streams - akka-stream

Most examples of FanOut shapes in Akka steams either merge the faned-out flows into a single stream, or immediately connect them to a sink. I want to instead return a sequence of Sources that I can then apply transformation and connect to a sinks later on. An example would look something like this
def transformedSources(src: Source[Int,NotUsed]) : Seq[Source[Int,NotUsed]] {
import GraphDSL.Implicits._
val builder = new GraphDSL.Builder
val bcast = builder.add(Broadcast[Int](2))
src.out ~> bcast.in
val sourceShapes = (bcast.out).map { out => SourceShape(out)}}
??? // How to convert sourceShapes into Sources
}
Is there a way to achieve this?
p.s
My real use case is an AmorphousShape that takes X sources, and produces X output. Ideally I would like to apply the AmorphousShape stage, and then continue operating as if nothing changes.
So if my code now is
sources.map{s => s.via(someStage).runWith(Sink.seq)}
I would like to transform it into
transformSource(sources).map{s => s.via(someStage).runWith(Sink.seq)}
def transformSource(sources:Seq[Source[Int, NotUsed]]) : Seq[Source[Int, NotUsed]] = {do some magic with an AmorphousShape graph }

Related

Karate: can we make multiple calls from scenario outline using an array variable? [duplicate]

I currently use junit5, wiremock and restassured for my integration tests. Karate looks very promising, yet I am struggling with the setup of data-driven tests a bit as I need to prepare a nested data structures which, in the current setup, looks like the following:
abstract class StationRequests(val stations: Collection<String>): ArgumentsProvider {
override fun provideArguments(context: ExtensionContext): java.util.stream.Stream<out Arguments>{
val now = LocalDateTime.now()
val samples = mutableListOf<Arguments>()
stations.forEach { station ->
Subscription.values().forEach { subscription ->
listOf(
*Device.values(),
null
).forEach { device ->
Stream.Protocol.values().forEach { protocol ->
listOf(
null,
now.minusMinutes(5),
now.minusHours(2),
now.minusDays(1)
).forEach { startTime ->
samples.add(
Arguments.of(
subscription, device, station, protocol, startTime
)
)
}
}
}
}
}
return java.util.stream.Stream.of(*samples.toTypedArray())
}
}
Is there any preferred way how to setup such nested data structures with karate? I initially thought about defining 5 different arrays with sample values for subscription, device, station, protocol and startTime and to combine and merge them into a single array which would be used in the Examples: section.
I did not succeed so far though and I am wondering if there is a better way to prepare such nested data driven tests?
I don't recommend nesting unless absolutely necessary. You may be able to "flatten" your permutations into a single table, something like this: https://github.com/intuit/karate/issues/661#issue-402624580
That said, look out for the alternate option to Examples: which just might work for your case: https://github.com/intuit/karate#data-driven-features
EDIT: In version 1.3.0, a new #setup life cycle was introduced that changes the example below a bit.
Here's a simple example:
Feature:
Scenario:
* def data = [{ rows: [{a: 1},{a: 2}] }, { rows: [{a: 3},{a: 4}] }]
* call read('called.feature#one') data
and this is: called.feature:
#ignore
Feature:
#one
Scenario:
* print 'one:', __loop
* call read('called.feature#two') rows
#two
Scenario:
* print 'two:', __loop
* print 'value of a:', a
This is how it looks like in the new HTML report (which is in 0.9.6.RC2 and may need more fine tuning) and it shows off how Karate can support "nesting" even in the report, which Cucumber cannot do. Maybe you can provide feedback and let us know if it is ready for release :)

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

akka streams how to handle multipe source that it self produce source

Hi i have a sitiation where there is a Source which it self produce Sources. The number of source that would be produced is not know in advanced. is there a proper design pattern to handle this case. Basically it would look like Source ----->Multiple Sources ------->Sink
EDIT
The Scenario for this is as follows.
Create a Source out of a database iterator
For each data base file provided by the above source transform the file to a Source
Attach those dynamically created source to a file IO sink
Basically i want bunch of data base content to be written to separate files via streams with back pressuring
Given a Source of Sources:
type Data = ???
val sos : Source[Source[Data, _], _] = ???
Each of the Data Sources can be drained into individual File Sinks using Source.runForeach.
We first need a function that can generate the Path that you want the data written to:
val pathCreator : () => Path = ???
And a way of converting Data to ByteString:
val dataToByteString : Data => ByteString = ???
These functions can finally be combined to get the behavior you're looking for:
val drainSourceToFile : Source[Data, _] => Future[IOResult] =
_.map(dataToByteString)
.to(FileIO.toPath(pathCreator()))
.run()
sos runForeach drainSourceToFile
If you want all of the IOResult values from FileIO.toPath so that you can know whether the writing was successful then you'll need a slightly more complicated setup:
val allIOResults : Future[Seq[IOResult]] =
sos.map(drainSourceToFile)
.to(Sink.seq)
.run()
.flatMap(Future.sequence)

Merging streams when _any_ of the substreams has a value ready

From Akka-stream documentation, it looks like that all stream merging options (merge, mergeSorted, mergePreferred, zipN, zipWithN) work by waiting when all merged streams have the new element ready, then applying the merge strategy (combining elements into a tuple, or applying zip function, etc.)
This works well for offline processing (e.g. reading the data from files or HTTP and combining it), but it introduces latency in online processing. I need to merge streams of data produced by e.g. multiple Websocket connection, and deliver updates in the merged stream as soon as any of the source streams produces a value. Example: if there are source streams A and B, here's what should be in the merged stream:
Output stream starts with some initial value, e.g. (None, None).
(A:1) (B:<not ready>) -> (Some(1), None)
(A:2) (B:<not ready>) -> (Some(2), None)
(A:3) (B:1) -> (Some(3), Some(1))
(A:3) (B:2) -> (Some(3), Some(2))
etc. Again, a new value appears in the output stream when any of the source stream produces a value, immediately.
Is there any combinator to achieve that?
As stated in the comments, Merge and MergePreferred stages do emit elements downstream even if not all upstreams have an element available.
From your example it looks like you are looking for zipping sources though. And yes, Zip emits the zipped tuple downstream only when it has elements to zip from all its upstreams. To overcome this you can 'lift' your sources to produce Options, and make them emit None whenever there is nothing else to emit. The source wrapper can look like this:
def asOption[In, Mat](source: Source[In, Mat]): Source[Option[In], Mat] =
Source.fromGraph(GraphDSL.create(source.map(Option(_))) {
implicit builder: GraphDSL.Builder[Mat] => src =>
import GraphDSL.Implicits._
val noneSource = Source.repeat(None)
val merge = builder.add(MergePreferred[Option[In]](1))
src ~> merge.preferred
noneSource ~> merge.in(0)
SourceShape(merge.out)
})
At this point you can zip your sources as you would normally.
val src1: Source[Int, NotUsed] = ???
val src2: Source[Int, NotUsed] = ???
val zipped = asOption(src1) zip asOption(src2)

How do I create a Flow with a different input and output types for use inside of a graph?

I am making a custom sink by building a graph on the inside. Here is a broad simplification of my code to demonstrate my question:
def mySink: Sink[Int, Unit] = Sink() { implicit builder =>
val entrance = builder.add(Flow[Int].buffer(500, OverflowStrategy.backpressure))
val toString = builder.add(Flow[Int, String, Unit].map(_.toString))
val printSink = builder.add(Sink.foreach(elem => println(elem)))
builder.addEdge(entrance.out, toString.in)
builder.addEdge(toString.out, printSink.in)
entrance.in
}
The problem I am having is that while it is valid to create a Flow with the same input/output types with only a single type argument and no value argument like: Flow[Int] (which is all over the documentation) it is not valid to only supply two type parameters and zero value parameters.
According to the reference documentation for the Flow object the apply method I am looking for is defined as
def apply[I, O]()(block: (Builder[Unit]) ⇒ (Inlet[I], Outlet[O])): Flow[I, O, Unit]
and says
Creates a Flow by passing a FlowGraph.Builder to the given create function.
The create function is expected to return a pair of Inlet and Outlet which correspond to the created Flows input and output ports.
It seems like I need to deal with another level of graph builders when I am trying to make what I think is a very simple flow. Is there an easier and more concise way to create a Flow that changes the type of it's input and output that doesn't require messing with it's inside ports? If this is the right way to approach this problem, what would a solution look like?
BONUS: Why is it easy to make a Flow that doesn't change the type of its input from it's output?
If you want to specify both the input and the output type of a flow, you indeed need to use the apply method you found in the documentation. Using it, though, is done pretty much exactly the same as you already did.
Flow[String, Message]() { implicit b =>
import FlowGraph.Implicits._
val reverseString = b.add(Flow[String].map[String] { msg => msg.reverse })
val mapStringToMsg = b.add(Flow[String].map[Message]( x => TextMessage.Strict(x)))
// connect the graph
reverseString ~> mapStringToMsg
// expose ports
(reverseString.inlet, mapStringToMsg.outlet)
}
Instead of just returning the inlet, you return a tuple, with the inlet and the outlet. This flow can now we used (for instance inside another builder, or directly with runWith) with a specific Source or Sink.

Resources