Parsing data from Kafka in Apache Flink

Parsing data from Kafka in Apache Flink - apache-flink

I am just getting started on Apache Flink (Scala API), my issue is following:
I am trying to stream data from Kafka into Apache Flink based on one example from the Flink site:
val stream =
env.addSource(new FlinkKafkaConsumer09("testing", new SimpleStringSchema() , properties))
Everything works correctly, the stream.print() statement displays the following on the screen:
2018-05-16 10:22:44 AM|1|11|-71.16|40.27
I would like to use a case class in order to load the data, I've tried using
flatMap(p=>p.split("|"))
but it's only splitting the data one character at a time.
Basically the expected results is to be able to populate 5 fields of the case class as follows
field(0)=2018-05-16 10:22:44 AM
field(1)=1
field(2)=11
field(3)=-71.16
field(4)=40.27
but it's now doing:
field(0) = 2
field(1) = 0
field(3) = 1
field(4) = 8
etc...
Any advice would be greatly appreciated.
Thank you in advance
Frank

The problem is the usage of String.split. If you call it with a String, then the method expects it to be a regular expression. Thus, p.split("\\|") would be the correct regular expression for your input data. Alternatively, you can also call the split variant where you specify the separating character p.split('|'). Both solutions should give you the desired result.

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?

(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.

Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.

I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

Appending values to DataSet in Apache Flink

I am currently writing an (simple) analytisis code to sum time connected powerreadings. With the data being assumingly raw (e.g. disturbances from the measuring device have not been calculated out) I have to account for disturbances by calculation the mean of the first one thousand samples. The calculation of the mean itself is not a problem. I only am unsure of how to generate the appropriate DataSet.
For now it looks about like this:
DataSet<Tupel2<long,double>>Gyrotron_1=ECRH.includeFields('11000000000'); // obviously the line to declare the first gyrotron, continues for the next ten lines, assuming separattion of not occupied space
DataSet<Tupel2<long,double>>Gyrotron_2=ECRH.includeFields('10100000000');
DataSet<Tupel2<long,double>>Gyrotron_3=ECRH.includeFields('10010000000');
DataSet<Tupel2<long,double>>Gyrotron_4=ECRH.includeFields('10001000000');
DataSet<Tupel2<long,double>>Gyrotron_5=ECRH.includeFields('10000100000');
DataSet<Tupel2<long,double>>Gyrotron_6=ECRH.includeFields('10000010000');
DataSet<Tupel2<long,double>>Gyrotron_7=ECRH.includeFields('10000001000');
DataSet<Tupel2<long,double>>Gyrotron_8=ECRH.includeFields('10000000100');
DataSet<Tupel2<long,double>>Gyrotron_9=ECRH.includeFields('10000000010');
DataSet<Tupel2<long,double>>Gyrotron_10=ECRH.includeFields('10000000001');
for (int=1,i<=10;i++) {
DataSet<double> offset=Gyroton_'+i+'.groupBy(1).first(1000).sum()/1000;
}
It's the part in the for-loop I'm unsure of. Does anybody know if it is possible to append values to DataSets and if so how?
In case of doubt, I could always put the values into an array but I do not know if that is the wise thing to do.

This code will not work for many reasons. I'd recommend looking into the fundamentals of Java and the basic data structures and also in Flink.
It's really hard to understand what you actually try to achieve but this is the closest that I came up with
String[] codes = { "11000000000", ..., "10000000001" };
DataSet<Tuple2<Long, Double>> result = env.fromElements();
for (final String code : codes) {
DataSet<Tuple2<Long, Double>> codeResult = ECRH.includeFields(code)
.groupBy(1)
.first(1000)
.sum(0)
.map(sum -> new Tuple2<>(sum.f0, sum.f1 / 1000d));
result = codeResult.union(result);
}
result.print();
But please take the time and understand the basics before delving deeper. I also recommend to use an IDE like IntelliJ that would point to at least 6 issues in your code.

Use content of a tuple as variable session

I extracted from a previous response an Object of tuple with the following regex :
.check(regex(""""idSc":(.{1,8}),"pasTemps":."codePasTemps":(.),"""").ofType[(String,String)].findAll.saveAs ("OBJECTS1"))
So I get my object :
OBJECTS1 -> List((1657751,2), (1658105,2), (4557378,2), (1657750,1), (916,1), (917,2), (1658068,1), (1658069,2), (4557379,2), (1658082,1), (4557367,1), (4557368,1), (1660865,2), (1660866,2), (1658122,1), (921,1), (922,2), (923,2), (1660875,1), (1660876,2), (1660877,2), (1658300,1), (1658301,1), (1658302,1), (1658309,1), (1658310,1), (2996562,1), (4638455,1))
After that I did a Foreach and need to extract every couple to add them in next requests So we tried :
.foreach("${OBJECTS1}", "couple") {
exec(http("request_foreach47"
.get("/ctr/web/api/seriegraph/bydates/${couple(0)}/${couple(1)}/1552863600000/1554191743799")
.headers(headers_27))
}
But I get the message : named 'couple' does not support index access
I also though that to use 2 regex on the couple to extract both part could work but I haven't found any way to use a regex on a session variable. (Even if its not needed for this case but possible im really interessed to learn how as it could be usefull)
If would be really thankfull if you could provided me help. (Im using Gatling 2 but can,'t use a more recent version as its for work and others scripts have been develloped with Gatling2)

each "couple" is a scala tuple which can't be indexed into like a collection. Fortunately the gatling EL has a function that handles tuples.
so instead of
.get("/ctr/web/api/seriegraph/bydates/${couple(0)}/${couple(1)}/1552863600000/1554191743799")
you can use
.get("/ctr/web/api/seriegraph/bydates/${couple._1}/${couple._2}/1552863600000/1554191743799")

Tone analyser only returns analysis for 1 sentence

When using tone analyser, I am only able to retrieve 1 result. For example, if I use the following input text.
string m_StringToAnalyse = "The World Rocks ! I Love Everything !! Bananas are awesome! Old King Cole was a merry old soul!";
The results only return the analysis for document level and sentence_id = 0, ie. "The World Rocks !". The analysis for the next 3 sentences are not returned.
Any idea what I am doing wrong or am I missing out anything? This is the case when running the provided sample code as well.
string m_StringToAnalyse = "This service enables people to discover and understand, and revise the impact of tone in their content. It uses linguistic analysis to detect and interpret emotional, social, and language cues found in text.";
Running Tone analysis using the sample code on the sample sentence provided above also return results for the document and the first sentence only.
I have tried with versions "2016-02-19" as well as "2017-03-15" with same results.

I believe that if you want sentence by sentence analysis you need to send every separate sentence as a JSON object. It will then return analysis in an array where id=SENTENCE_NUM.
Here is an example of one I did using multiple YouTube comments (using Python):
def get_comments(video):
#Get the comments from the Youtube API using requests
url = 'https://www.googleapis.com/youtube/v3/commentThreads?part=snippet&maxResults=100&videoId='+ video +'&key=' + youtube_credentials['api_key']
r = requests.get(url)
comment_dict = list()
# for item in comments, add an object to the list with the text of the comment
for item in r.json()['items']:
the_comment = {"text": item['snippet']['topLevelComment']['snippet']['textOriginal']}
comment_dict.append(the_comment)
# return the list as JSON to the sentiment_analysis function
return json.dumps(comment_dict)
def sentiment_analysis(words):
# Load Watson Credentials using Python SDK
tone_analyzer = ToneAnalyzerV3(
username=watson_credentials['username'], password=watson_credentials['password'], version='2016-02-11')
# Get the tone, based on the JSON object that is passed to sentiment_analysis
return_sentiment = json.dumps(tone_analyzer.tone(text=words), indent=2)
return_sentiment = json.loads(return_sentiment)
Afterwards you can do whatever you want with the JSON object. I would also like to note for anyone else looking at this if you want to do an analysis of many objects, you can add sentences=False in the tone_analyzer.tone function.

Parsing Solr Results - javabin format

I am trying to integrate solr with java using solrj. The result retrieved are of the format
{
numFound=3,
start=0,
docs=[
SolrDocument{
id=IW-02,
name=iPod&iPodMiniUSB2.0Cable,
manu=Belkin,
manu_id_s=belkin,
cat=[
electronics,
connector
],
features=[
carpoweradapterforiPod,
white
],
weight=2.0,
price=11.5,
price_c=11.50,
USD,
popularity=1,
inStock=false,
store=37.7752,
-122.4232,
manufacturedate_dt=TueFeb1418: 55: 59EST2006,
_version_=1452625905160552448
}
Now this is the javabin format. How do I extract results from this? Have heard that solrj does convert the results to objects by itself. But cant figure out how.
Thanks for the help in advance.

Let solrReply be the response object. The you can access different parts of the result using appropriate params. Say you want docs, you can do:
docs = solrReply['docs']
if you want the first result you could do:
first = solrReply['docs'][0]
Within a result you can access each field in the same way.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight