Merging streams when _any_ of the substreams has a value ready - akka-stream

From Akka-stream documentation, it looks like that all stream merging options (merge, mergeSorted, mergePreferred, zipN, zipWithN) work by waiting when all merged streams have the new element ready, then applying the merge strategy (combining elements into a tuple, or applying zip function, etc.)
This works well for offline processing (e.g. reading the data from files or HTTP and combining it), but it introduces latency in online processing. I need to merge streams of data produced by e.g. multiple Websocket connection, and deliver updates in the merged stream as soon as any of the source streams produces a value. Example: if there are source streams A and B, here's what should be in the merged stream:
Output stream starts with some initial value, e.g. (None, None).
(A:1) (B:<not ready>) -> (Some(1), None)
(A:2) (B:<not ready>) -> (Some(2), None)
(A:3) (B:1) -> (Some(3), Some(1))
(A:3) (B:2) -> (Some(3), Some(2))
etc. Again, a new value appears in the output stream when any of the source stream produces a value, immediately.
Is there any combinator to achieve that?

As stated in the comments, Merge and MergePreferred stages do emit elements downstream even if not all upstreams have an element available.
From your example it looks like you are looking for zipping sources though. And yes, Zip emits the zipped tuple downstream only when it has elements to zip from all its upstreams. To overcome this you can 'lift' your sources to produce Options, and make them emit None whenever there is nothing else to emit. The source wrapper can look like this:
def asOption[In, Mat](source: Source[In, Mat]): Source[Option[In], Mat] =
Source.fromGraph(GraphDSL.create(source.map(Option(_))) {
implicit builder: GraphDSL.Builder[Mat] => src =>
import GraphDSL.Implicits._
val noneSource = Source.repeat(None)
val merge = builder.add(MergePreferred[Option[In]](1))
src ~> merge.preferred
noneSource ~> merge.in(0)
SourceShape(merge.out)
})
At this point you can zip your sources as you would normally.
val src1: Source[Int, NotUsed] = ???
val src2: Source[Int, NotUsed] = ???
val zipped = asOption(src1) zip asOption(src2)

Related

How to export Snowflake Web UI Worksheet SQL to file

Classic Snowflake Web UI and the new Snowsight are great at importing sql from a file but neither allows you to export sql to a file. Is there a workaround?
You can use an IDE to connect to snowflake and write queries. Then the scripts can be downloaded using IDE features and can sync with git repo as well.
dbeaver is one such IDE which supports snowflake :
https://hevodata.com/learn/dbeaver-snowflake/
The query pane is interactive so the obvious workaround will be:
CTRL + A (select all)
CTRL + C (copy)
<open_favourite_text_editor>
CTRL + P (paste)
CTRL + S (save)
This tool can help you while the team develops a native feature to export worksheets:
"Snowflake Snowsight Extensions wrap Snowsight features that do not have API or SQL alternatives, such as manipulating Dashboards and Worksheets, and retrieving Query Profile and step timings."
https://github.com/Snowflake-Labs/sfsnowsightextensions
Further explained on this post:
https://medium.com/snowflake/importing-and-exporting-snowsight-dashboards-and-worksheets-3cd8e34d29c8
For example, to save to a file within PowerShell:
PS > $dashboards | foreach {$_.SaveToFolder(“path/to/folder”)}
PS > $dashboards[0].SaveToFile(“path/to/folder/mydashboard.json”)
ETA: I'm adding this edit to the front because this is what actually worked.
Again, BSON was a dead end & punycode is irrelevant. I don't know why punycode is referenced in the metadata file; but my best guess is that they might use punycode to encode the worksheet name itself (though I'm not sure why that would be needed since it shouldn't need to be part of a URL).
After doing terrible things and trying a number of complex ways of dealing with escape character hell, I found that the actual encoding is very simple. It just works as an 8 bit encoding with anything that might cause problems escaped away (null, control codes, double quotes, etc.). To load, treat the file as a text file using an 8-bit encoding; extract the data as a JSON field, then re-encode that extracted data as that same encoding. I just used latin_1 to read; but it may not even matter which encoding you use as long as you are consistent and use the same one to re-encode. The encoded field will then be valid zlib compressed data.
I decided that I wanted to start from scratch so I needed to back the worksheets first and I made a Python script based on my findings above. Be warned that this may return even worksheets that you previously closed for good. After running this and verifying that backups were created, I just ran rm #~/worksheet_data/;, closed the tab & reopened it.
Here's the code (fill in the appropriate base directory location):
import os
from collections import OrderedDict
import configparser
from sqlalchemy import create_engine, exc
from snowflake.sqlalchemy import URL
import pathlib
import json
import zlib
import string
def format_filename(s: str) -> str: # From https://gist.github.com/seanh/93666
"""Take a string and return a valid filename constructed from the string.
Uses a whitelist approach: any characters not present in valid_chars are
removed. Also spaces are replaced with underscores.
Note: this method may produce invalid filenames such as ``, `.` or `..`
When I use this method I prepend a date string like '2009_01_15_19_46_32_'
and append a file extension like '.txt', so I avoid the potential of using
an invalid filename.
"""
valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
filename = ''.join(c for c in s if c in valid_chars)
# filename = filename.replace(' ','_') # I don't like spaces in filenames.
return filename
def trlng_dash(s: str) -> str:
"""Removes trailing character if present."""
return s[:-1] if s[-1] == '-' else s
sso_authenticate = True
# Assumes CLI config file exists.
config = configparser.ConfigParser()
home = pathlib.Path.home()
config_loc = home/'.snowsql/config' # Assumes it's set up from Snowflake CLI.
base_dir = home/r'{your Desired base directory goes here.}'
json_dir = base_dir/'json' # Location for your worksheet stage JSON files.
sql_dir = base_dir/'sql' # Location for your worksheets.
# Assumes CLI config file exists.
config.read(config_loc)
# Add connection parameters here (assumes CLI config exists).
# Using sso so only 2 are needed.
# If there's no config file, etc. enter by hand here (or however you want to do it).
connection_params = {
'account': config['connections']['accountname'],
'user': config['connections']['username'],
}
if sso_authenticate:
connection_params['authenticator'] = 'externalbrowser'
if config['connections'].get('password', None) is not None:
connection_params['password'] = config['connections']['password']
if config['connections'].get('rolename', None) is not None:
connection_params['role'] = config['connections']['rolename']
if locals().get('database', None) is not None:
connection_params['database'] = database
if locals().get('schema', None) is not None:
connection_params['schema'] = schema
sf_engine = create_engine(URL(**connection_params))
if not base_dir.exists():
base_dir.mkdir()
if not json_dir.exists():
json_dir.mkdir()
if not (sql_dir).exists():
sql_dir.mkdir()
with sf_engine.connect() as connection:
connection.execute(f'get #~/worksheet_data/ \'file://{str(json_dir.as_posix())}\';')
for file in [path for path in json_dir.glob('*') if path.is_file()]:
if file.suffix != '.json':
file.replace(file.with_suffix(file.suffix + '.json'))
with open(json_dir/'metadata.json', 'r') as metadata_file:
files_meta = json.load(metadata_file)
# List of files from metadata file will contain some empty worksheets.
files_description_orig = OrderedDict((file_key_value['name'], file_key_value) for file_key_value in sorted(files_meta['activeWorksheets'] + list(files_meta['inactiveWorksheets'].values()), key=lambda x: x['name']) if file_key_value['name'])
# files_description will only track non empty worksheets
files_description = files_description_orig.copy()
# Create updated files description filtering out empty worksheets.
for item in files_description_orig:
json_file = json_dir/f"{files_description_orig[item]['name']}.json"
# If a file didn't make it or was deleted by hand, we should
# remove from the filtered description & continue to the next item.
if not (json_file.exists() and json_file.is_file()):
del files_description[item]
continue
with open(json_file, 'r', encoding='latin_1') as f:
json_dat = json.load(f)
# If the file represents a worksheet with a body field, we want it.
if not json_dat['wsContents'].get('body'):
del files_description[item]
## Delete JSON files corresponsing to empty worksheets.
# f.close()
# try:
# (json_dir/f"{files_description_orig[item]['name']}.json").unlink()
# except:
# pass
# Produce a list of normalized filenames (no illegal or awkward characters).
file_names = set(
format_filename(trlng_dash(files_description[item]['encodedDetails']['scriptName']).strip())
for item in files_description)
# Add useful information to our files_description OrderedDict
for file_name in file_names:
repeats_cnt = 0
file_name_repeats = (
item
for item
in files_description
if file_name == format_filename(trlng_dash(files_description[item]['encodedDetails']['scriptName']).strip())
)
for file_uuid in file_name_repeats:
files_description[file_uuid]['normalizedName'] = file_name
files_description[file_uuid]['stemSuffix'] = '' if repeats_cnt == 0 else f'({repeats_cnt:0>2})'
repeats_cnt += 1
# Now we iterate on non-empty worksheets only.
for item in files_description:
json_file = json_dir/f"{files_description[item]['name']}.json"
with open(json_file, 'r', encoding='latin_1') as f:
json_dat = json.load(f)
body = json_dat['wsContents']['body']
body_bin = body.encode('latin_1')
body_txt = zlib.decompress(body_bin).decode('utf8')
sql_file = sql_dir/f"{files_description[item]['normalizedName']}{files_description[item]['stemSuffix']}.sql"
with open(sql_file, 'w') as sql_f:
sql_f.write(body_txt)
creation_stamp = files_description[item]['created']/1000
os.utime(sql_file, (creation_stamp,creation_stamp))
print('Done!')
As mentioned at Is there any option in snowflake to save or load worksheets? (and in Snowflake's own documentation), in the Classic UI, the worksheets are saved at the user stage under #~/worksheet_data/.
You can download it with a get command like:
get #~/worksheet_data/<name> file:///<your local location>; (though you might need quoting if running from Windows).
The problem is that I do not know how to access it programmatically. The downloaded files look like JSON but it is not valid JSON. The main key is "wsContents" and contains most of the worksheet information. Its value includes two subkeys, "encoding" and "body".
The "encoding" key denotes that gzip is being used. The "body" key seems to be the actual worksheet data which looks a lot like a straight binary representation of the compressed text data. As such, any JSON reader will choke on it.
If it is anything like that, I do not currently know how to access it programmatically using Python.
I do see that a JSON like format exists, BSON, that is bundled into PyMongo. Trying to use this on these files fails. I even tried bson.is_valid and it returns False so I am assuming that it means that these files in Snowflake are not actually BSON.
Edited to add: Again, BSON is a dead end.
Examining the "body" value as just binary data, the first two bytes of sample files do seem to correspond to default zlib compression (0x789c). However, attempting to run straight zlib.decompress on the slice created from that first byte to the last corresponding to the first & last characters of the "body" value results in the error:
Error - 3 while decompressing data: invalid code lengths set
This makes me think that the bytes there, as is, are at least partly garbage and still need some processing before they can be decompressed.
One clue that I failed to mention earlier is that the metadata file (called "metadata" and which serves as an inventory of the remaining files at the #~/worksheet_data/ location) declares that the files use the punycode encoding. However, I have not known how to use that information. The data in these files doesn't particularly look like what I feel punycode should look like nor does it particularly make sense to me that you would use punycode on binary data that is not meant to ever be used to directly generate text such as zlib compressed data.

akka streams how to handle multipe source that it self produce source

Hi i have a sitiation where there is a Source which it self produce Sources. The number of source that would be produced is not know in advanced. is there a proper design pattern to handle this case. Basically it would look like Source ----->Multiple Sources ------->Sink
EDIT
The Scenario for this is as follows.
Create a Source out of a database iterator
For each data base file provided by the above source transform the file to a Source
Attach those dynamically created source to a file IO sink
Basically i want bunch of data base content to be written to separate files via streams with back pressuring
Given a Source of Sources:
type Data = ???
val sos : Source[Source[Data, _], _] = ???
Each of the Data Sources can be drained into individual File Sinks using Source.runForeach.
We first need a function that can generate the Path that you want the data written to:
val pathCreator : () => Path = ???
And a way of converting Data to ByteString:
val dataToByteString : Data => ByteString = ???
These functions can finally be combined to get the behavior you're looking for:
val drainSourceToFile : Source[Data, _] => Future[IOResult] =
_.map(dataToByteString)
.to(FileIO.toPath(pathCreator()))
.run()
sos runForeach drainSourceToFile
If you want all of the IOResult values from FileIO.toPath so that you can know whether the writing was successful then you'll need a slightly more complicated setup:
val allIOResults : Future[Seq[IOResult]] =
sos.map(drainSourceToFile)
.to(Sink.seq)
.run()
.flatMap(Future.sequence)

Connecting SourceShape/PortOpts from a UniformFanOutShape to sources in Akka Streams

Most examples of FanOut shapes in Akka steams either merge the faned-out flows into a single stream, or immediately connect them to a sink. I want to instead return a sequence of Sources that I can then apply transformation and connect to a sinks later on. An example would look something like this
def transformedSources(src: Source[Int,NotUsed]) : Seq[Source[Int,NotUsed]] {
import GraphDSL.Implicits._
val builder = new GraphDSL.Builder
val bcast = builder.add(Broadcast[Int](2))
src.out ~> bcast.in
val sourceShapes = (bcast.out).map { out => SourceShape(out)}}
??? // How to convert sourceShapes into Sources
}
Is there a way to achieve this?
p.s
My real use case is an AmorphousShape that takes X sources, and produces X output. Ideally I would like to apply the AmorphousShape stage, and then continue operating as if nothing changes.
So if my code now is
sources.map{s => s.via(someStage).runWith(Sink.seq)}
I would like to transform it into
transformSource(sources).map{s => s.via(someStage).runWith(Sink.seq)}
def transformSource(sources:Seq[Source[Int, NotUsed]]) : Seq[Source[Int, NotUsed]] = {do some magic with an AmorphousShape graph }

Flink CEP is not deterministic

I have the following code running locally without a cluster:
val count = new AtomicInteger()
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val text: DataStream[String] = env.readTextFile("file:///flink/data2")
val mapped: DataStream[Map[String, Any]] = text.map((x: String) => Map("user" -> x.split(",")(0), "val" -> x.split(",")(1)))
val pattern: ...
CEP.pattern(mapped, pattern).select(eventMap => {
println("Found: " + (patternName, eventMap))
count.incrementAndGet()
})
env.execute()
println(count)
My data is a CSV file in the following format (user, val):
1,1
1,2
1,3
2,1
2,2
2,3
...
I am trying to detect events of the pattern where event(val=1) -> event(val=2) -> event(val=3). When I run this on a large input stream, with a set number of events that I know exist in the stream, I get an inconsistent count of events detected, almost always less than the number of events in the system. If I do env.setParallelism(1) (Like I have done in line 3 of the code), all events are detected.
I assume the problem is that multiple threads are processing the events from the stream when the parallelism is > 1, which means that while one thread has event(val=1) -> event(val=2), event(val=3) might be sent to a different thread and the whole pattern might not get detected.
Is there something I'm missing here? I cannot lose any patterns in the stream, but setting parallelism to 1 seems to defeat the purpose of having a system like Flink to detect events.
Update:
I have tried keying the stream using:
val mapped: KeyedStream[Map[String, Any]] = text.map(...).keyBy((m) => m.get("user"))
Though this prevents events of different users interfering with each other:
1,1
2,2
1,3
This does not prevent Flink from sending the events to the node out of order, which means that the non-determinism still exists.
Most probably the problem lies in applying the keyBy operator after the map operator.
So, instead of:
val mapped: KeyedStream[Map[String, Any]] = text.map(...).keyBy((m) => m.get("user"))
There should be:
val mapped: KeyedStream[Map[String, Any]] = text.keyBy((m) => m.get("user")).map(...)
I know this is an old question, but maybe it helps someone.
Have you thought about keying your stream with the userid (your first value)? Flink guarantees that all events of one key get to the same processing node.
Of course that only helps, if you want to detect a pattern of val=1->val=2->val=3 per user.

Merge results of ExecuteSQL processor with Json content in nifi 6.0

I am dealing with json objects containing geo coordinate points. I would like to run these points against a postgis server I have locally to assess point in polygon matching.
I'm hoping to do this with preexisting processors - I am successfully extracting the lat/lon coordinates into attributes with an "EvaluateJsonPath" processor, and successfully issuing queries to my local postgis datastore with "ExecuteSQL". This leaves me with avro responses, which I can then convert to JSON with the "ConvertAvroToJSON" processor.
I'm having conceptual trouble with how to merge the results of the query back together with the original JSON object. As it is, I've got two flow files with the same fragment ID, which I could theoretically merge together with "mergecontent", but that gets me:
{"my":"original json", "coordinates":[47.38, 179.22]}{"polygon_match":"a123"}
Are there any suggested strategies for merging the results of the SQL query into the original json structure, so my result would be something like this instead:
{"my":"original json", "coordinates":[47.38, 179.22], "polygon_match":"a123"}
I am running nifi 6.0, postgres 9.5.2, and postgis 2.2.1.
I saw some reference to using replaceText processor in https://community.hortonworks.com/questions/22090/issue-merging-content-in-nifi.html - but this seems to be merging content from an attribute into the body of the content. I'm missing the point of merging the content of the original and either the content of the SQL response, or attributes extracted from the SQL response without the content.
Edit:
Groovy script following appears to do what is needed. I am not a groovy coder, so any improvements are welcome.
import org.apache.commons.io.IOUtils
import java.nio.charset.*
import groovy.json.JsonSlurper
def flowFile = session.get();
if (flowFile == null) {
return;
}
def slurper = new JsonSlurper()
flowFile = session.write(flowFile,
{ inputStream, outputStream ->
def text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
def obj = slurper.parseText(text)
def originaljsontext = flowFile.getAttribute('original.json')
def originaljson = slurper.parseText(originaljsontext)
originaljson.put("point_polygon_info", obj)
outputStream.write(groovy.json.JsonOutput.toJson(originaljson).getBytes(StandardCharsets.UTF_8))
} as StreamCallback)
session.transfer(flowFile, ExecuteScript.REL_SUCCESS)
If your original JSON is relatively small, a possible approach might be the following...
Use ExtractText before getting to ExecuteSQL to copy the original JSON into an attribute.
After ExecuteSQL, and after ConvertAvroToJSON, use an ExecuteScript processor to create a new JSON document that combines the original from the attribute with the results in the content.
I'm not exactly sure what needs to be done in the script, but I know others have had success using Groovy and JsonSlurper through the ExecuteScript processor.
http://groovy-lang.org/json.html
http://docs.groovy-lang.org/latest/html/gapi/groovy/json/JsonSlurper.html

Resources