Camel enrich SQL syntax issue - apache-camel

I'm tasked with creating a Camel route using Camel version 2.20.0 that takes a line in from a CSV file uses a value from that line in the SQL statement where clause and merges the results and outputs them again. If I hardcode the identifier in the SQL statement it works fine, if I try and use a dynamic URI I get an error.
The route is:
from("file:///tmp?fileName=test.csv")
.split()
.tokenize("\n")
.streaming()
.parallelProcessing(true)
.setHeader("userID", constant("1001"))
//.enrich("sql:select emplid,name from employees where emplid = '1001'",
.enrich("sql:select name from employees where emplid = :#userID",
new AggregationStrategy() {
public Exchange aggregate(Exchange oldExchange,
Exchange newExchange) {...
As I said if I uncomment the line with the hardcoded 1001 it queries the db and works as expected. However using the ':#userID' syntax I get an Oracle error of:
java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
Message History
---------------------------------------------------------------------------------------------------------------------------------------
RouteId ProcessorId Processor Elapsed (ms)
[route3 ] [route3 ] [file:///tmp?fileName=test.csv ] [ 43]
[route3 ] [log5 ] [log ] [ 2]
[route3 ] [setHeader2 ] [setHeader[userID] ] [ 0]
[route3 ] [enrich2 ] [enrich[constant{sql:select name from employees where emplid = :#userID] [ 40]
The table is clearly there because it works when the value is hardcoded so it's got something to do with passing in the dynamic value. I've tried lots of variations on how to pass that variable in, inside single quotes, using values from the body instead of headers, etc. and haven't found the working combination yet though I've seen lots of similar seemingly working examples.
I've turned trace on it appears the header is correctly set as well:
o.a.camel.processor.interceptor.Tracer : >>> (route3) setHeader[userID, 1001] --> enrich[constant{sql:select name from employees where emplid = :#userID}] <<< Pattern:InOnly, Headers:{CamelFileAbsolute=true, CamelFileAbsolutePath=/tmp/test.csv, CamelFileLastModified=1513116018000, CamelFileLength=26, CamelFileName=test.csv, CamelFileNameConsumed=test.csv, CamelFileNameOnly=test.csv, CamelFileParent=/tmp, CamelFilePath=/tmp/test.csv, CamelFileRelativePath=test.csv, userID=1001}, BodyType:String, Body:1001,SomeValue,MoreValues
What needs to change to make this work?
I should also note I've tried this approach, using various syntax options to refer to the header value, without any luck:
.enrich().simple("sql:select * from employees where emplid = :#${in.header.userID}").aggregate ...

From the docs:
From Camel 2.16 onwards both enrich and pollEnrich supports dynamic endpoints that uses an Expression to compute the uri, which allows to use data from the current Exchange. In other words all what is told above no longer apply and it just works.
As you are using 2.20, I think you may try this example:
from("file:///tmp?fileName=test.csv")
.split()
.tokenize("\n")
.streaming()
.parallelProcessing(true)
.setHeader("userID", constant("1001"))
//.enrich("sql:select emplid,name from employees where emplid = '1001'",
.enrich("sql:select name from employees where emplid = ':#${in.header.userID}'",
new AggregationStrategy() {
public Exchange aggregate(Exchange oldExchange,
Exchange newExchange) {...
Take a look at the Expression topic in docs for further examples.
To sum up, the expression could be:
"sql:select name from employees where emplid = ':#${in.header.userID}'"
EDIT:
Sorry, I've missed the :# suffix. You could see a unit test working here.
Just take care with the columns types. If it's a integer, you shouldn't need the quotes.
Cheers!

From the Camel docs:
pollEnrich or enrich does not access any data from the current
Exchange which means when polling it cannot use any of the existing
headers you may have set on the Exchange.
The recommended way of achieving what you want is to instead use the recipientList, so I suggest you read up on that.
Content Enricher
Recipient List
Edit:
As Ricardo Zanini rightly pointed out in his answer it is actually possible to achieve this with Camel-versions from 2.16 onwards. As the OP is using 2.20 my answer is invalid.
I will, however, keep my answer but want to point out that this is only valid if you're using an older version than 2.16.

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

Spring Data Mongo: query by example(QBE) does not work for records without "_class" property

If I have records in mongoDb without "_class" property, query by example does not work. My database is populated by third-party non-java microservice, by the way.
Example:
{
"_id":"5ec3f00d98326d4c0ead815f",
"first_name":"firstName",
"last_name":"lastName"
}
Then MongoRepository.findAll(Example<S> example) is not able to find that record. If I add correct "_class" field manually, all works as expected.
Has someone solved this issue?
Spring Data mongo v.3.0.0.RC1
Ok, there is UntypedExampleMatcher that must be used in this case:
ExampleMatcher matcher = UntypedExampleMatcher.matching()
.withIgnoreNullValues()
.withIgnoreCase();
Entity probe = ...
Example<Entity> entityExample = Example.of(probe, matcher);
entityRepo.findAll(entityExample);
But this approach does not work by some reason. It is very long-running, and ends with exception at the end.
UPDATE:
Because of incorrect "probe", my request tried to fetch from DB thousands of records, therefore it ended up with exception. After fixing it, search with QBE approach works as a charm.

Apache Camel: How to use "done" files to identify records written into a file is over and it can be moved

As the title suggests, I want to move a file into a different folder after I am done writing DB records to to it.
I have already looked into several questions related to this: Apache camel file with doneFileName
But my problem is a little different since I am using split, stream and parallelProcessing for getting the DB records and writing to a file. I am not able to know when and how to create the done file along with the parallelProcessing. Here is the code snippet:
My route to fetch records and write it to a file:
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
.split(body(), <aggregation>).streaming().parallelProcessing()
.<some processors>
.aggregate(header(Exchange.FILE_NAME), (o, n) -> {
<file aggregation>
return o;
}).completionInterval(<some time interval>)
.toD("file://<to the temp file>")
.end()
.end()
.to("file:"+<path to temp folder>+"?doneFileName=${file:header."+Exchange.FILE_NAME+"}.done"); //this line is just for trying out done filename
In my aggregation strategy for the splitter I have code that basically counts records processed and prepares the response that would be sent back to the caller.
And in my other aggregate outside I have code for aggregating the db rows and post that writing into the file.
And here is the file listener for moving the file:
from("file://<path to temp folder>?delete=true&include=<filename>.*.TXT&doneFileName=done")
.to(file://<final filename with path>?fileExist=Append);
Doing something like this is giving me this error:
Caused by: [org.apache.camel.component.file.GenericFileOperationFailedException - Cannot store file: <folder-path>/filename.TXT] org.apache.camel.component.file.GenericFileOperationFailedException: Cannot store file: <folder-path>/filename.TXT
at org.apache.camel.component.file.FileOperations.storeFile(FileOperations.java:292)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.writeFile(GenericFileProducer.java:277)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.processExchange(GenericFileProducer.java:165)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.process(GenericFileProducer.java:79)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.util.AsyncProcessorConverterHelper$ProcessorToAsyncProcessorBridge.process(AsyncProcessorConverterHelper.java:61)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.SendProcessor.process(SendProcessor.java:141)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:460)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.Pipeline.process(Pipeline.java:121)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.Pipeline.process(Pipeline.java:83)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.sendToConsumers(SedaConsumer.java:298)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.doRun(SedaConsumer.java:207)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.run(SedaConsumer.java:154)[209:org.apache.camel.camel-core:2.16.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[:1.8.0_144]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_144]
Caused by: org.apache.camel.InvalidPayloadException: No body available of type: java.io.InputStream but has value: Total number of records discovered: 5
What am I doing wrong? Any inputs will help.
PS: Newly introduced to Apache Camel
I would guess that the error comes from .toD("file://<to the temp file>") trying to write a file, but finds the wrong type of body (String Total number of records discovered: 5 instead of InputStream.
I don't understand why you have one file-destinations inside the splitter and one outside of it.
As #claus-ibsen suggested try to remove this extra .aggregate(...) in your route. To split and re-aggregate it is sufficient to reference the aggregation strategy in the splitter. Claus also pointed to an example in the Camel docs
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
.split(body(), <aggregationStrategy>)
.streaming().parallelProcessing()
// the processors below get individual parts
.<some processors>
.end()
// The end statement above ends split-and-aggregate. From here
// you get the re-aggregated result of the splitter.
// So you can simply write it to a file and also write the done-file
.to(...);
However, if you need to control the aggregation sizes, you have to combine splitter and aggregator. That would look somehow like this
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
// No aggregationStrategy here so it is a standard splitter
.split(body())
.streaming().parallelProcessing()
// the processors below get individual parts
.<some processors>
.end()
// The end statement above ends split. From here
// you still got individual records from the splitter.
.to(seda:aggregate);
// new route to do the controlled aggregation
from("seda:aggregate")
// constant(true) is the correlation predicate => collect all messages in 1 aggregation
.aggregate(constant(true), new YourAggregationStrategy())
.completionSize(500)
// not sure if this 'end' is needed
.end()
// write files with 500 aggregated records here
.to("...");

Parsing data from Kafka in Apache Flink

I am just getting started on Apache Flink (Scala API), my issue is following:
I am trying to stream data from Kafka into Apache Flink based on one example from the Flink site:
val stream =
env.addSource(new FlinkKafkaConsumer09("testing", new SimpleStringSchema() , properties))
Everything works correctly, the stream.print() statement displays the following on the screen:
2018-05-16 10:22:44 AM|1|11|-71.16|40.27
I would like to use a case class in order to load the data, I've tried using
flatMap(p=>p.split("|"))
but it's only splitting the data one character at a time.
Basically the expected results is to be able to populate 5 fields of the case class as follows
field(0)=2018-05-16 10:22:44 AM
field(1)=1
field(2)=11
field(3)=-71.16
field(4)=40.27
but it's now doing:
field(0) = 2
field(1) = 0
field(3) = 1
field(4) = 8
etc...
Any advice would be greatly appreciated.
Thank you in advance
Frank
The problem is the usage of String.split. If you call it with a String, then the method expects it to be a regular expression. Thus, p.split("\\|") would be the correct regular expression for your input data. Alternatively, you can also call the split variant where you specify the separating character p.split('|'). Both solutions should give you the desired result.

JMeter: How to count JSON objects in an Array using jsonpath

In JMeter I want to check the number of objects in a JSON array, which I receive from the server.
For example, on a certain request I expect an array with 5 objects.
[{...},{...},{...},{...},{...}]
After reading this: count members with jsonpath?, I tried using the following JSON Path Assertion:
JSON Path: $
Expected value: hasSize(5)
Validate against expected value = checked
However, this doesn't seem to work properly. When I actually do receive 5 objects in the array, the response assertion says it doesn't match.
What am I doing wrong?
Or how else can I do this?
Although JSONPath Extractor doesn't provide hasSize function it still can be done.
Given the example JSON from the answer by PMD UBIK-INGENIERIE, you can get matches number on book array in at least 2 ways:
1. Easiest (but fragile) way - using Regular Expression Extractor.
As you can see, there are 4 entries for category like:
{ "category": "reference",
{ \"category\": \"fiction\"
...
If you add a Regular Expression Extractor configured as follows:
It'll capture all the category entries and return matches number as below:
So you will be able to use this ${matches_matchNr} variable wherever required.
This approach is straightforward and easy to implement but it's very vulnerable to any changes in the response format. If you expect that JSON data may change in the foreseeable future continue with the next option.
2. Harder (but more stable) way - calling JsonPath methods from Beanshell PostProcessor
JMeter has a Beanshell scripting extension mechanism which has access to all variables/properties in scope as well as to the underlying JMeter and 3rd-party dependencies APIs. In this case you can call JsonPath library (which is under the hood of JsonPath Extractor) directly from Beanshell PostProcessor.
import com.jayway.jsonpath.Criteria;
import com.jayway.jsonpath.Filter;
import com.jayway.jsonpath.JsonPath;
Object json = new String(data);
List categories = new ArrayList();
categories.add("fiction");
categories.add("reference");
Filter filter = Filter.filter(Criteria.where("category").in(categories));
List books = JsonPath.read(json, "$.store.book[?]", new Filter[] {filter});
vars.put("JSON_ARRAY_SIZE", String.valueOf(books.size()));
The code above evaluates JSONPath expression of $.store.book[?] against parent sampler response, counts matches number and stores it into ${JSON_ARRAY_SIZE} JMeter Variable
which can later be reused in an if clause or an assertion.
References:
JMeter – Working with JSON – Extract JSON response
JMeter's User Manual Regular Expressions entry
JSON Path Documentation and Examples
How to use BeanShell: JMeter's favorite built-in component
This is not possible with the plugin you are using (JMeter-plugins).
But it can be done with JSON Extractor since JMeter 3.0, this plugin has been donated by UbikLoadPack (http://jmeter.apache.org/changes_history.html)
Example:
Say you have this JSON that contains an array of books:
{ "store": {"book": [
{ "category": "reference","author": "Nigel Rees","title": "Sayings of the Century","price": 8.95},
{ "category": "fiction","author": "Evelyn Waugh","title": "Sword of Honour","price": 12.99},
{ "category": "fiction","author": "Herman Melville","title": "Moby Dick","isbn": "0-553-21311-3","price": 8.99},
{ "category": "fiction","author": "J. R. R. Tolkien","title": "The Lord of the Rings","isbn": "0-395-19395-8","price": 22.99}
],
"bicycle": {"color": "red","price": 19.95}} }
To have this count:
1/ Add JSON Extractor:
The count will be then available bookTitle_matchNr which you can access through:
${bookTitle_matchNr}
Running this Test Plan would display this:
As you can see, Debug Sampler-${bookTitle_matchNr} shows Debug Sampler-4

Resources