Merge results of ExecuteSQL processor with Json content in nifi 6.0 - postgis

I am dealing with json objects containing geo coordinate points. I would like to run these points against a postgis server I have locally to assess point in polygon matching.
I'm hoping to do this with preexisting processors - I am successfully extracting the lat/lon coordinates into attributes with an "EvaluateJsonPath" processor, and successfully issuing queries to my local postgis datastore with "ExecuteSQL". This leaves me with avro responses, which I can then convert to JSON with the "ConvertAvroToJSON" processor.
I'm having conceptual trouble with how to merge the results of the query back together with the original JSON object. As it is, I've got two flow files with the same fragment ID, which I could theoretically merge together with "mergecontent", but that gets me:
{"my":"original json", "coordinates":[47.38, 179.22]}{"polygon_match":"a123"}
Are there any suggested strategies for merging the results of the SQL query into the original json structure, so my result would be something like this instead:
{"my":"original json", "coordinates":[47.38, 179.22], "polygon_match":"a123"}
I am running nifi 6.0, postgres 9.5.2, and postgis 2.2.1.
I saw some reference to using replaceText processor in https://community.hortonworks.com/questions/22090/issue-merging-content-in-nifi.html - but this seems to be merging content from an attribute into the body of the content. I'm missing the point of merging the content of the original and either the content of the SQL response, or attributes extracted from the SQL response without the content.
Edit:
Groovy script following appears to do what is needed. I am not a groovy coder, so any improvements are welcome.
import org.apache.commons.io.IOUtils
import java.nio.charset.*
import groovy.json.JsonSlurper
def flowFile = session.get();
if (flowFile == null) {
return;
}
def slurper = new JsonSlurper()
flowFile = session.write(flowFile,
{ inputStream, outputStream ->
def text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
def obj = slurper.parseText(text)
def originaljsontext = flowFile.getAttribute('original.json')
def originaljson = slurper.parseText(originaljsontext)
originaljson.put("point_polygon_info", obj)
outputStream.write(groovy.json.JsonOutput.toJson(originaljson).getBytes(StandardCharsets.UTF_8))
} as StreamCallback)
session.transfer(flowFile, ExecuteScript.REL_SUCCESS)

If your original JSON is relatively small, a possible approach might be the following...
Use ExtractText before getting to ExecuteSQL to copy the original JSON into an attribute.
After ExecuteSQL, and after ConvertAvroToJSON, use an ExecuteScript processor to create a new JSON document that combines the original from the attribute with the results in the content.
I'm not exactly sure what needs to be done in the script, but I know others have had success using Groovy and JsonSlurper through the ExecuteScript processor.
http://groovy-lang.org/json.html
http://docs.groovy-lang.org/latest/html/gapi/groovy/json/JsonSlurper.html

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

JSON array keyerror in Python

I'm fairly new to Python programming and am attempting to extract data from a JSON array. Below code results in an error for
js[jstring][jkeys]['5. volume'])
Any help would be much appreciated.
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
url = https://www.alphavantage.co/queryfunction=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
try:
js = json.loads(data)
except:
js = None
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
print('volume',DailyData(symbol)[5])
Looks like the reason for the error is because the returned data from the URL is a bit more hierarchical than you may realize. To see that, print out js (I recommend using a jupyter notebook):
import urllib.request, urllib.parse, urllib.error
import ssl
import json
import sqlite3
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
js
You can see that js (now a python dict) has a "Meta Data" key before the actual time series begins. You need to start operating on the dict at that key.
Having said that, to get the data into a table like structure (for plotting, time series analysis, etc), you can use pandas package to read the dict key directly into a dataframe. The pandas DataFrame constructor accepts a dict as input. In this case, the data was transposed, so the T at the end rotates it (try with and without the T and you will see it.
import pandas as pd
df=pd.DataFrame(js['Time Series (Daily)']).T
df
Added edit... You could get the data into a dataframe with a single line of code:
import requests
import pandas as pd
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo"
data = pd.DataFrame(requests.get(url).json()['Time Series (Daily)']).T
DataFrame: The contructor from Pandas to make data into a table like structure
requests.get(): method from the requests library to fetch data..
.json(): directly converts from JSON to a dict
['Time Series (Daily)']: pulls out the key from the dict that is the time series
.T: transposes the rows and columns.
Good luck!
Following code worked for me
import urllib.request, urllib.parse, urllib.error
import json
def DailyData(symb):
# Your code was missing the ? after query
url = "https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={}&apikey=demo".format(symb)
stockdata = urllib.request.urlopen(url)
data = stockdata.read().decode()
js = json.loads(data)
jstring = 'Time Series (Daily)'
for entry in js:
i = js[jstring].keys()
for jkeys in i:
return (jkeys,
js[jstring][jkeys]['1. open'],
js[jstring][jkeys]['2. high'],
js[jstring][jkeys]['3. low'],
js[jstring][jkeys]['4. close'],
js[jstring][jkeys]['5. volume'])
# query multiple times, just to print one item?
print('open',DailyData('MSFT')[1])
print('high',DailyData('MSFT')[2])
print('low',DailyData('MSFT')[3])
print('close',DailyData('MSFT')[4])
print('volume',DailyData('MSFT')[5])
Output:
open 99.8850
high 101.4300
low 99.6700
close 101.1600
volume 19234627
Without seeing the error, it's hard to know what exact problem you were having.

Finatra FeatureTests: How to manually deserialize returned json

I read the Finatra getting started guide and I was able to write the HelloWorld Service and its feature test.
Currently my feature test looks like
server.httpPost(
path = "/hi",
postBody = """{"name": "Foo", "dob": 136190040000}""",
andExpect = Ok,
withBody = """{"msg":"Hello Foo. You are 15780 days old today"}""")
This works fine and my tests pass. However my requirement is that I extract the json returned by the server and then manually perform asserts on the object returned.
I changed my code to
val response = server.httpPost(
path = "/hi",
postBody = """{"name": "Abhishek", "dob": 136190040000}""",
andExpect = Ok,
withBody = """{"msg":"Hello Abhishek. You are 15780 days old today"}""")
val json = response.contentString
This also works and I can see the json returned in side the variable json.
My question is that if I have to deserialize this json into an object. Should I just pull in any json library like circe? and then deserialize the object?
or can I use the jackson framework which comes inside of Finatra.
In all examples I could find, I see that Finatra "automatically" handles the json serialization and deserialization. But in my case I want to perform this manually.
You can use the FinatraObjectMapper by calling (using your example) server.mapper. That wraps a Jackson ObjectMapper that you could use if you wanted to use the Jackson library without any of the Finatra add ons.
Or you can import your a different JSON library. If you are using SBT, you can restrict libraries to certain areas of your code, so if you wanted to use circe only in the test code, you could add the following to your build.sbt
"org.scalatest" %% "scalatest" % "2.2.6" % "test"

how to get array from and xml file in swift

I am new to swift but I have made an android app where a string array is selected from an xml file. This is a large xml file that contains a lot of string arrays and the app gets the relevant string array based on a user selection.
I am now trying to develop the same app for iOS using swift. I would like to use the same xml file but I can not see and easy way to get the correct array. For example, part of the xml looks like this
<string-array name="OCR_Businessstudies_A_Topics">
<item>1. Business objectives and strategic decisions</item>
<item>2. External influences facing businesses</item>
<item>3. Marketing and marketing strategies</item>
<item>4. Operational strategy</item>
<item>5. Human resources</item>
<item>6. Accounting and financial considerations</item>
<item>7. The global environment of business</item>
</string-array>
<string-array name="OCR_Businessstudies_AS_Topics">
<item>1. Business objectives and strategic decisions</item>
<item>2. External influences facing businesses</item>
<item>3. Marketing and marketing strategies</item>
<item>4. Operational strategy</item>
<item>5. Human resources</item>
<item>6. Accounting and financial considerations</item>
</string-array>
If I have the string "OCR_Businessstudies_A_Topics" how do i get the "OCR_Businessstudies_A_Topics" array from the xml file.
This is very straight forward in android and although I have used online tutorials for swift it seems like I have to parse the xml file but do not seem to be getting anywhere.
Is there a better approach than trying to parse the whole xml fie?
Thanks
Barry
You can write your own XML parser, conforming to NSXMLParser or use a library like HTMLReader:
let fileURL = NSBundle.mainBundle().URLForResource("data", withExtension: "xml")!
let xmlData = NSData(contentsOfURL: fileURL)!
let topic = "OCR_Businessstudies_A_Topics"
let document = HTMLDocument(data: xmlData, contentTypeHeader: "text/xml")
for item in document.nodesMatchingSelector("string-array[name='\(topic)'] item") {
print(item.textContent)
}

Setting array equal to JSON array - Xcode

I'm trying to figure out how to populate a table from a JSON array. So far, I can populate my table cells perfectly fine by using the following code:
self.countries = [[NSArray alloc]initWithObjects:#"Argentina",#"China",#"Russia",nil];
Concerning the JSON, I can successfully retrieve one line of text at a time and display it in a label. My goal is to populate an entire table view from a JSON array. I tried using the following code, but it still won't populate my table. Obviously I'm doing something wrong, but I searched everywhere and still can't figure it out:
NSURL *url = [NSURL URLWithString:#"http://BlahBlahBlah.com/CountryList"];
NSURLRequest *request = [NSURLRequest requestWithURL:url];
AFJSONRequestOperation *operation = [AFJSONRequestOperation JSONRequestOperationWithRequest:request success:^(NSURLRequest *request, NSHTTPURLResponse *response, id JSON)
{
NSLog(#"%#",[JSON objectForKey:#"COUNTRIES"]);
self.countries = [JSON objectForKey:#"COUNTRIES"];
}
failure:nil];
[operation start];
I am positive that the data is being retrieved, because the NSLog outputs the text perfectly fine. But when I try setting my array equal to the JSON array, nothing happens. I know the code is probably wrong, but I think I'm on the right track. Your help would be much appreciated.
EDIT:
This is the text in the JSON file I'm using:
{
"COUNTRIES": ["Argentina", "China", "Russia",]
}
-Miles
It seems that you need some basic JSON parsing. If you only target iOS 5.0 and above devices, then you should use NSJSONSerialization. If you need to support earlier iOS versions, then I really recommend the open source JSONKit framework.
Having recommended the above, I myself almost always use the Sensible TableView framework to fetch all data from my web service and automatically display it on a table view. Saves me a ton of manual labor and makes app maintenance a breeze, so it's probably something to consider too. Good luck!

Resources