akka streams how to handle multipe source that it self produce source - akka-stream

Hi i have a sitiation where there is a Source which it self produce Sources. The number of source that would be produced is not know in advanced. is there a proper design pattern to handle this case. Basically it would look like Source ----->Multiple Sources ------->Sink
EDIT
The Scenario for this is as follows.
Create a Source out of a database iterator
For each data base file provided by the above source transform the file to a Source
Attach those dynamically created source to a file IO sink
Basically i want bunch of data base content to be written to separate files via streams with back pressuring

Given a Source of Sources:
type Data = ???
val sos : Source[Source[Data, _], _] = ???
Each of the Data Sources can be drained into individual File Sinks using Source.runForeach.
We first need a function that can generate the Path that you want the data written to:
val pathCreator : () => Path = ???
And a way of converting Data to ByteString:
val dataToByteString : Data => ByteString = ???
These functions can finally be combined to get the behavior you're looking for:
val drainSourceToFile : Source[Data, _] => Future[IOResult] =
_.map(dataToByteString)
.to(FileIO.toPath(pathCreator()))
.run()
sos runForeach drainSourceToFile
If you want all of the IOResult values from FileIO.toPath so that you can know whether the writing was successful then you'll need a slightly more complicated setup:
val allIOResults : Future[Seq[IOResult]] =
sos.map(drainSourceToFile)
.to(Sink.seq)
.run()
.flatMap(Future.sequence)

Related

Is there a file object to get path or name of a file in Nim?

Let's say, I would like to use a single object to represent a file and I'd like to get the filename (or path) of it so that I can use the name to remove the file or for other standard library procedures. I'd like to have a single abstraction which can be used with all available file-related standard library procedures.
I've found FileInfo but in my research I didn't find a get-file-name-procedure. File and FileHandle are pretty useless from a software engineering point of view because they provide no convenient abstraction and don't have members.
Is there a file abstraction (object) in Nim, which provides fast access to FileInfo as well as the file name so that a file doesn't need more than one procedure parameter?
There is no such abstraction in Nim, or any other language, simply because you are asking for an impossible thing to do with most filesystems. Consider the FileInfo structure and its linkCount field which tells you the number of hard links the file object has. But there is no way to get-a-filename from one or all of those links short of building and updating yourself a database of the whole filesystem.
While most filesystems allow access to files through paths, there is rarely a filesystem that gives paths from files because they actually don't need one! An example would be a Unix filesystem where one process opens a file through a path, then removes the path without closing the file. While the process holding the file open is alive, that file won't actually disappear, so you would have the case of a file without path.
The issue of handling paths, especially considering cross platform applications, involves its own can of worms: if you store paths as strings, what is the path separator and how do you escape it? Does your filesystem support volumes that require special case handling? What string encoding do paths use to satisfy all users? Just the encoding issue requires tons of tables and conversions which would bog down every other API wishing to get just a file like handle to read or write bytes.
A FileInfo is just a snapshot of the state of the file at a given time, a file handle is the live file object you can operate on, and a path (or many paths if your filesystem supports hard links) is just a convenience name for end users.
These are all very different things, which is why they are separate. Your app may need a more complex abstraction than other programmers are willing to tolerate, so create own abstraction which holds together all the individual pieces you need. For instance, consider the following structure:
import os
type
AppFileInfo = object
fileInfo: FileInfo
file: File
oneOfMany: string
proc changeFileExt(appFileInfo: AppFileInfo, ext: string): string =
changeFileExt(appFileInfo.oneOfMany, ext)
proc readAll(appFileInfo: AppFileInfo): string =
readAll(appFileInfo.file)
Those procs simply mimic the respective standard library APIs but use your more complex structure as inputs and transform it as needed. If you are worried about this abstraction not being optimised due to the extra proc call you could use a template instead.
If you follow this route, however, at some point you will have to ask yourself what is the lifetime of an AppFileInfo object: do you create it with a path? Do you create it from a file handle? Is it safe to access the file field in parts of your code or has it not been initialised properly? Do you return errors or throw exceptions when something goes wrong? Maybe when you start to ask yourself these questions you'll realise they are very app specific and are very difficult to generalise for every use case. Therefore such a complex object doesn't make much sense in the language standard library.
I created the missing solution myself. I basically extended the File type using a global encapsulated table. Extending Types like this could be a useful idiom in Nim because of UFCS.
import tables
type FileObject = object
file : File
mode : FileMode
path : string
proc initFileObject(name: string; mode: FileMode; bufsize = -1) : FileObject =
result.file = open(name, mode, bufsize)
result.path = name
result.mode = mode
var g_fileObjects = initTable[File, FileObject]()
template get(this: File) : var FileObject = g_fileObjects[this]
proc openFile*(filepath: string; mode: FileMode = fmRead; bufsize = -1) : File =
var fileObject = initFileObject(filepath, mode, bufsize)
result = fileObject.file
g_fileObjects[result] = fileObject
proc filePath*(this: File) : string {.raises: KeyError.} =
return this.get.path
proc fileMode*(this: File) : FileMode {.raises: KeyError.} =
return this.get.mode
from os import tryRemoveFile
proc closeOrDeleteFile[delete = false](this: File) : bool =
result = g_fileObjects.hasKey(this)
if result:
when delete:
result = this.filepath.tryRemoveFile()
g_fileObjects.del(this)
this.close()
proc closeFile*(this: File) : bool = this.closeOrDeleteFile[:false]
proc deleteFile*(this: File) : bool = this.closeOrDeleteFile[:true]
Now you can write
var f = openFile("myFile.txt", fmWrite)
var g = openFile("hello.txt", fmWrite)
echo f.filePath
echo f.deleteFile()
g.writeLine(g.filePath)
echo g.closeFile()

Send XLSX file as mail attachment via ABAP

I have to create an email and attach an XLSX file. I looked at the BCS_EXAMPLE_7 program.
I have transformed the content with the following method:
TRY.
cl_bcs_convert=>string_to_solix(
EXPORTING
iv_string = lv_content
iv_codepage = '4103'
iv_add_bom = 'X'
IMPORTING
et_solix = pt_binary_content
ev_size = pv_size ).
CATCH cx_bcs.
ls_return-type = text-023.
ls_return-message = text-024.
APPEND ls_return TO pt_return.
ENDTRY.
CONCATENATE lv_save_file_name '_' sy-datum '.xlsx' INTO lv_save_file_name.
lv_attachment_subject = lv_save_file_name.
CONCATENATE '&SO_FILENAME=' lv_attachment_subject INTO ls_attachment_header.
APPEND ls_attachment_header TO lt_attachment_header.
lo_document->add_attachment( i_attachment_type = 'XLS'
i_attachment_subject = lv_attachment_subject
i_attachment_size = pv_size
i_att_content_hex = pt_binary_content
i_attachment_header = lt_attachment_header ).
The email is sent correctly but when I open the attachment I see the error
Cannot open the file because the file extension is incorrect
Could you help me? thanks
That's a normal behavior of Excel, unrelated to ABAP, when the file name has extension .xlsx but doesn't contain data in format corresponding to XLSX. Excel does the same kind of checks for other extensions. If you need more information about these checks, please search the Web.
As I see that your program creates the attachment based on text converted into UTF-16LE code page (SAP code page 4103), I guess that you created the Excel data in format CSV, tab-separated values or even the old Excel XMLSS/XML 2003 format.
In that case, the extension .xlsx is not valid, to avoid the message, use the adequate extension, respectively .csv, .txt or .xml.
If you really need the extension .xlsx for some reason, then you must create the data in XLSX format. You may use the free API abap2xlsx. If you need further assistance about how to use abap2xlsx, please ask a new question (unrelated to email).
NB: maybe you were told to use the extension .xlsx although there is no real need to use it (each format has its own features, but simple unformatted values can be achieved with all formats), in that case you may propose to use a simple format like CSV or tab-separated values.
NB: you may also have the opposite case that Excel sniffs that the file contains data in format corresponding to XLSX, but the file name doesn't have the extension .xlsx, and the same for all other formats, but I can't say what is the exact Excel reaction to each case.
It appears that whatever you have in lv_content isn't actually a valid excel file. You can not just take arbitrary data, give it the extension .xlsx and expect MS Excel to know what to do with it.
Unfortunately, creating valid MS Office files is anything but trivial. It's a format which is theoretically open and based on XML (actually a zip archive containing multiple XML files), but in practice the specification is over a 5000(!) pages long.
Fortunately, there is a library for that. abap2xlsx is an open source (Apache License) library which provides an easy API to create (and read) valid XLSX files in ABAP.
You could also try to open the file with a text editor (eg. NotePad++), maybe this gives a hint of the actual content.
But I guess that something went wrong generating the binary table. Maybe you are using the wrong file size or code page.
Possible problems:
First problem: as correctly said by Sandra you may have invalid content of your lv_content variable, which doesn't correspond to correct XLSX structure.
Second problem: which you already solved, as seen from your coding, BCS classes do not support 4-character extensions.
Here is the sample how to build and send correct XLSX file via mail:
SELECT * UP TO 100 ROWS
FROM spfli
INTO TABLE #DATA(lt_spfli).
cl_salv_table=>factory( IMPORTING r_salv_table = DATA(lr_table)
CHANGING t_table = lt_spfli ).
DATA: lr_xldimension TYPE REF TO if_ixml_node,
lr_xlworksheet TYPE REF TO if_ixml_element.
DATA(lv_xlsx) = lr_table->to_xml( if_salv_bs_xml=>c_type_xlsx ).
DATA(lr_zip) = NEW cl_abap_zip( ).
lr_zip->load( lv_xlsx ).
lr_zip->get( EXPORTING name = 'xl/worksheets/sheet1.xml' IMPORTING content = DATA(lv_file) ).
DATA(lr_file) = NEW cl_xml_document( ).
lr_file->parse_xstring( lv_file ).
* Row elements are under SheetData
DATA(lr_xlnode) = lr_file->find_node( 'sheetData' ).
DATA(lr_xlrows) = lr_xlnode->get_children( ).
* Create new element in the XML file
lr_xlworksheet ?= lr_file->find_node( 'worksheet' ).
DATA(lr_xlsheetpr) = cl_ixml=>create( )->create_document( )->create_element( name = 'sheetPr' ).
DATA(lr_xloutlinepr) = cl_ixml=>create( )->create_document( )->create_element( name = 'outlinePr' ).
lr_xlsheetpr->if_ixml_node~append_child( lr_xloutlinepr ).
lr_xloutlinepr->set_attribute( name = 'summaryBelow' value = 'false' ).
lr_xldimension ?= lr_file->find_node( 'dimension' ).
lr_xlworksheet->if_ixml_node~insert_child( new_child = lr_xlsheetpr ref_child = lr_xldimension ).
* Create xstring and move it to XLSX
lr_file->render_2_xstring( IMPORTING stream = lv_file ).
lr_zip->delete( EXPORTING name = 'xl/worksheets/sheet1.xml' ).
lr_zip->add( EXPORTING name = 'xl/worksheets/sheet1.xml' content = lv_file ).
lv_xlsx = lr_zip->save( ).
DATA lv_size TYPE i.
DATA lt_bintab TYPE solix_tab.
* Convert to binary
CALL FUNCTION 'SCMS_XSTRING_TO_BINARY'
EXPORTING
buffer = lv_xlsx
IMPORTING
output_length = lv_size
TABLES
binary_tab = lt_bintab.
DATA main_text TYPE bcsy_text.
* create persistent send request
DATA(send_request) = cl_bcs=>create_persistent( ).
* create document object from internal table with text
APPEND 'Valid Excel file' TO main_text.
DATA(document) = cl_document_bcs=>create_document( i_type = 'RAW' i_text = main_text i_subject = 'Test Created for stella' ).
DATA lt_att_head TYPE soli_tab.
APPEND '<(>&< )>SO_FILENAME=MySheet.xlsx' TO lt_att_head.
* add the spread sheet as attachment to document object
document->add_attachment(
i_attachment_type = 'xls'
i_attachment_subject = 'MySheet'
i_attachment_size = CONV so_obj_len( lv_size )
i_attachment_header = lt_att_head
i_att_content_hex = lt_bintab ).
send_request->set_document( document ).
DATA(recipient) = cl_cam_address_bcs=>create_internet_address( 'some_recipient#mail.com' ).
send_request->add_recipient( recipient ).
DATA(sent_to_all) = send_request->send( i_with_error_screen = 'X' ).
COMMIT WORK.

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

Merging streams when _any_ of the substreams has a value ready

From Akka-stream documentation, it looks like that all stream merging options (merge, mergeSorted, mergePreferred, zipN, zipWithN) work by waiting when all merged streams have the new element ready, then applying the merge strategy (combining elements into a tuple, or applying zip function, etc.)
This works well for offline processing (e.g. reading the data from files or HTTP and combining it), but it introduces latency in online processing. I need to merge streams of data produced by e.g. multiple Websocket connection, and deliver updates in the merged stream as soon as any of the source streams produces a value. Example: if there are source streams A and B, here's what should be in the merged stream:
Output stream starts with some initial value, e.g. (None, None).
(A:1) (B:<not ready>) -> (Some(1), None)
(A:2) (B:<not ready>) -> (Some(2), None)
(A:3) (B:1) -> (Some(3), Some(1))
(A:3) (B:2) -> (Some(3), Some(2))
etc. Again, a new value appears in the output stream when any of the source stream produces a value, immediately.
Is there any combinator to achieve that?
As stated in the comments, Merge and MergePreferred stages do emit elements downstream even if not all upstreams have an element available.
From your example it looks like you are looking for zipping sources though. And yes, Zip emits the zipped tuple downstream only when it has elements to zip from all its upstreams. To overcome this you can 'lift' your sources to produce Options, and make them emit None whenever there is nothing else to emit. The source wrapper can look like this:
def asOption[In, Mat](source: Source[In, Mat]): Source[Option[In], Mat] =
Source.fromGraph(GraphDSL.create(source.map(Option(_))) {
implicit builder: GraphDSL.Builder[Mat] => src =>
import GraphDSL.Implicits._
val noneSource = Source.repeat(None)
val merge = builder.add(MergePreferred[Option[In]](1))
src ~> merge.preferred
noneSource ~> merge.in(0)
SourceShape(merge.out)
})
At this point you can zip your sources as you would normally.
val src1: Source[Int, NotUsed] = ???
val src2: Source[Int, NotUsed] = ???
val zipped = asOption(src1) zip asOption(src2)

Merge results of ExecuteSQL processor with Json content in nifi 6.0

I am dealing with json objects containing geo coordinate points. I would like to run these points against a postgis server I have locally to assess point in polygon matching.
I'm hoping to do this with preexisting processors - I am successfully extracting the lat/lon coordinates into attributes with an "EvaluateJsonPath" processor, and successfully issuing queries to my local postgis datastore with "ExecuteSQL". This leaves me with avro responses, which I can then convert to JSON with the "ConvertAvroToJSON" processor.
I'm having conceptual trouble with how to merge the results of the query back together with the original JSON object. As it is, I've got two flow files with the same fragment ID, which I could theoretically merge together with "mergecontent", but that gets me:
{"my":"original json", "coordinates":[47.38, 179.22]}{"polygon_match":"a123"}
Are there any suggested strategies for merging the results of the SQL query into the original json structure, so my result would be something like this instead:
{"my":"original json", "coordinates":[47.38, 179.22], "polygon_match":"a123"}
I am running nifi 6.0, postgres 9.5.2, and postgis 2.2.1.
I saw some reference to using replaceText processor in https://community.hortonworks.com/questions/22090/issue-merging-content-in-nifi.html - but this seems to be merging content from an attribute into the body of the content. I'm missing the point of merging the content of the original and either the content of the SQL response, or attributes extracted from the SQL response without the content.
Edit:
Groovy script following appears to do what is needed. I am not a groovy coder, so any improvements are welcome.
import org.apache.commons.io.IOUtils
import java.nio.charset.*
import groovy.json.JsonSlurper
def flowFile = session.get();
if (flowFile == null) {
return;
}
def slurper = new JsonSlurper()
flowFile = session.write(flowFile,
{ inputStream, outputStream ->
def text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
def obj = slurper.parseText(text)
def originaljsontext = flowFile.getAttribute('original.json')
def originaljson = slurper.parseText(originaljsontext)
originaljson.put("point_polygon_info", obj)
outputStream.write(groovy.json.JsonOutput.toJson(originaljson).getBytes(StandardCharsets.UTF_8))
} as StreamCallback)
session.transfer(flowFile, ExecuteScript.REL_SUCCESS)
If your original JSON is relatively small, a possible approach might be the following...
Use ExtractText before getting to ExecuteSQL to copy the original JSON into an attribute.
After ExecuteSQL, and after ConvertAvroToJSON, use an ExecuteScript processor to create a new JSON document that combines the original from the attribute with the results in the content.
I'm not exactly sure what needs to be done in the script, but I know others have had success using Groovy and JsonSlurper through the ExecuteScript processor.
http://groovy-lang.org/json.html
http://docs.groovy-lang.org/latest/html/gapi/groovy/json/JsonSlurper.html

Resources