Apache Nifi: Parse data with UpdateRecord Processor - arrays

I'm trying to parse some data in Nifi (1.7.1) using UpdateRecord Processor.
Original data are json files, that I would like to convert to Avro, based on a schema.
The Avro conversion is ok, but in that convertion I also need to parse one array element from the json data to a different structure in Avro.
This is a sample data of the input json:
{ "geometry" : {
"coordinates" : [ [ 4.963087975800593, 45.76365595859971 ], [ 4.962874487781098, 45.76320922779652 ], [ 4.962815443439148, 45.763116079159374 ], [ 4.962744732112515, 45.763010484202866 ], [ 4.962096825239138, 45.762112721939246 ] ]} ...}
Being its schema (specified in RecordReader):
{ "type": "record",
"name": "features",
"fields": [
{
"name": "geometry",
"type": {
"type": "record",
"name": "geometry",
"fields": [
{
"name": "coordinatesJson",
"type": {
"type": "array",
"items": {
"type": "array",
"items": "double"
}
}
},
]
}
},
....
]
}
As you can see, coordinates is an array of arrays.
And I need to parse those data to Avro, based on this schema (specified in RecordWriter):
{
"name": "outputdata",
"type": "record",
"fields": [
{"name": "coordinatesAvro",
"type": {
"type": "array",
"items" : {
"type" : "record",
"name" : "coordinatesAvro",
"fields" : [ {
"name" : "X",
"type" : "double"
}, {
"name" : "Y",
"type" : "double"
} ]
}
}
},
.....
]
}
The problem here is that I'm not being able to parse from coordinatesJson to coordinatesAvro, using RecordPath functions
I tried several mappings, like:
Property: Value:
/coordinatesJson[0..-1]/X /geometry/coordinatesAvro[*][0]
/coordinatesJson[0..-1]/Y /geometry/coordinatesAvro[*][1]
It should be a pretty straighforward parsing step, but as I said, I've been going in circles to achive this for a while.
Any help would be really appreciated.

When I collide with something like that I do next:
1) Transofrm Json into Json with strcuture that I need (for example in your case: coordinatesAvro) by ExecuteScript Processor. I have used ECMAScript cause you can simple parse JSON and work with objects (transform them).
2) ConvertJsonToAvro with one common schema (coordinatesAvro in your case) for Reader and Writer.
It works very good and I have used it on BigData cases. This is one of possible resolutions for your problem.

Related

JSONSchema keyword "type" when encased inside "items" fails to validate

I'm trying to write a json validator to check files before runtime but there's a really odd issue happening with using "type".
I know "type" is a reserved word but jsonSchema doesn't have an issue with it if it doesn't have a value pairing as I found in this other question: Key values of 'key' and 'type' in json schema.
The solution to their problem was encasing "type": { "type": "string"} inside "properties" and it does work. However, my implementation requires it to be inside an array. Here's the snippet of my code:
{
"type": "object",
"additionalProperties": false,
"properties":{
"method":{
"type": "array",
"items":{
"type": {
"type": "string"
},
"name":{
"type": "string"
},
"provider": {
"type": "array"
}
}
}
}
}
Oddly enough, VScode doesn't have a problem with it in a file when it's isolated, but when it's included in the main code, it doeesn't like it and yields no solution. Regardless, validating it with python yields:
...
raise exceptions.SchemaError.create_from(error)
jsonschema.exceptions.SchemaError: {'type': 'string'} is not valid under any of the given schemas
Failed validating 'anyOf' in metaschema['allOf'][1]['properties']['properties']['additionalProperties']['$dynamicRef']['allOf'][1]['properties']['items']['$dynamicRef']['allOf'][3]['properties']['type']:
{'anyOf': [{'$ref': '#/$defs/simpleTypes'},
{'items': {'$ref': '#/$defs/simpleTypes'},
'minItems': 1,
'type': 'array',
'uniqueItems': True}]}
On schema['properties']['method']['items']['type']:
{'type': 'string'}
What further confuses me is that https://www.jsonschemavalidator.net/ tells me
Expected array or string for 'type', got StartObject. Path 'properties.method.items.type', line 8, position 17. yet JSON Schema Faker is able to generate a fake file without any problems. The generated fake json also returns the same error when validated with python and JSONSchemaValidator.
I'm a beginner and any help or insight will be greatly appreciated, thanks for your time.
Edit: here's the snippet of the input data as requested.
{
...
"method": [
{
"type": "action",
"name": "name of the chaos experiment to use here",
"provider": [
]
}
}
]
}
Arrays don't have properties; arrays have items. The schema as you have included it is not valid; are you sure you don't mean to have this?
{
"type": "object",
"additionalProperties": false,
"properties":{
"method":{
"type": "array",
"items":{
"type": "object",
"properties": {
"type": {
"type": "string"
},
"name":{
"type": "string"
},
"provider": {
"type": "array"
}
}
}
}
}
}

how to apply patterns on array enums in JSON schema

I have a simple JSON schema that looks like so (and works)
{
"cols": {
"type": "array",
"items": {
"type": "string",
"enum": [
"id",
"name",
"age",
"affiliation",
""
]
},
"additionalProperties": false
}
}
I would like the enum to be the values prescribed above + a decoration so that any of the following would be allowed
"enum" = [
"id",
"lower(name)",
"average(age)",
"distinct(affiliation)",
""
]
In other words, for cols
cols=id would be valid but no further decoration would be allowed around id
cols=name and cols=lower(name) would be valid
cols=age and cols=average(age) would be valid
cols=affiliation and cols=distinct(affiliation) would be valid
cols='' empty string would be valid
Specifying the decorations as patterns would be great so that they would be case-insensitive. For example, cols=lower(name) and cols=LOWER(name) would both be ok.
You can change your enumerated list in enum to a list of patterns:
"items": [
"type": "string",
"anyOf": [
{ "pattern": "^cols\b...the rest of your pattern here...$" },
{ etc... }
]
]

JMESPath query for nested array structures

I have the following data structure as a result of aws logs get-query-results:
{
"status": "Complete",
"statistics": {
"recordsMatched": 2.0,
"recordsScanned": 13281.0,
"bytesScanned": 7526096.0
},
"results": [
[
{
"field": "time",
"value": "2019-01-31T21:53:01.136Z"
},
{
"field": "requestId",
"value": "a9c233f7-0b1b-3326-9b0f-eba428e4572c"
},
{
"field": "logLevel",
"value": "INFO"
},
{
"field": "callerId",
"value": "a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
}
],
[
{
"field": "time",
"value": "2019-01-25T13:13:01.062Z"
},
{
"field": "requestId",
"value": "a4332628-1b9b-a9c2-0feb-0cd4a3f7cb63"
},
{
"field": "logLevel",
"value": "INFO"
},
{
"field": "callerId",
"value": "a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
}
],
]
}
The AWS CLI support JMESPath language for filtering output. I need to apply a query string, to filter among the returned "results" the objects that contain the "callerId" as a "field", retrieve the "value" property and obtain the following output:
[
{
callerId: "a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
},
{
callerId: "a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
}
]
The first step I do is flatter the results array with the query string: results[]
This will get read of the other root properties (status, statistics) and return only one big array with all of the {field: ..., value: ...} alike objects. But after this I can't manage to properly filter for those objects that match field=="callerId". I tried, among others, the following expressions without success:
'results[][?field=="callerId"]'
'results[][*][?field=="callerId"]'
'results[].{ callerId: #[?field=="callerId"].value }'
I'm not an expert in JMESPath and I was doing the tutorials of the jmespath.org site but couldn't manage to make it work.
Thanks!
Using jq is a good thing because it's more complete language, but if you want to do it with JMES Path here the solution:
results[*][?field=='callerId'].{callerId: value}[]
to get:
[
{
"callerId": "a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
},
{
"callerId": "a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
}
]
I'm not able to reproduce fully since I don't have the same logs in my log stream but I was able to do this using jq and putting the sample JSON object in a file
cat sample_output.json | jq '.results[][] | select(.field=="callerId") | .value'
OUTPUT:
"a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
"a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
you could pipe the output from the aws cli to jq.
I was able to get pretty close with the native JMESPath query and using the built in editor in this site
http://jmespath.org/examples.html#filtering-and-selecting-nested-data
results[*][?field==`callerId`][]
OUTPUT:
[
{
"field": "callerId",
"value": "a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
},
{
"field": "callerId",
"value": "a9b0f9c2-eb42-3986-33f7-8e450b1b72cf"
}
]
but I'm not sure how to get callerId to be the key and the value to be the value from another key.

Swagger array of strings without name

Currently I am trying to create a swagger file for my software.
Now I would like to create a definition for a timeRange.
My problem is that this array looks like this:
timeRange: {
"2016-01-15T09:00:00.000Z", // this is the start date
"2017-01-15T09:00:00.000Z" // this is the end date
}
How can I create an example value that works out of the box?
It is an "array of strings" with a minimum of two.
"timeRange": {
"type": "array",
"items": {
"type": "string",
"example": "2017-01-15T09:00:00.000Z,2017-01-15T09:00:00.000Z"
}
}
This generates an example like this:
"timeRange": [
"2017-01-15T09:00:00.000Z,2017-01-15T09:00:00.000Z"
]
This example does not work, because it is an array and not an object.
All together:
How can I realize an example value that exists out of two different strings (without a name).
Hope you can help me!
Cheers!
timeRange: {
"2016-01-15T09:00:00.000Z", // this is the start date
"2017-01-15T09:00:00.000Z" // this is the end date
}
is not valid JSON – "timeRange" needs to be enclosed in quotes, and the object/array syntax should be different.
If using the object syntax {}, the values need to be named properties:
"timeRange": {
"start_date": "2016-01-15T09:00:00.000Z",
"end_date": "2017-01-15T09:00:00.000Z"
}
Otherwise timeRange needs to be an [] array:
"timeRange": [
"2016-01-15T09:00:00.000Z",
"2017-01-15T09:00:00.000Z"
]
In the first example ({} object), your Swagger would look as follows, with a separate example for each named property:
"timeRange": {
"type": "object",
"properties": {
"start_date": {
"type": "string",
"format": "date-time",
"example": "2016-01-15T09:00:00.000Z"
},
"end_date": {
"type": "string",
"format": "date-time",
"example": "2017-01-15T09:00:00.000Z"
}
},
"required": ["start_date", "end_date"]
}
In case of an [] array, you can specify an array-level example that is a multi-item array:
"timeRange": {
"type": "array",
"items": {
"type": "string",
"format": "date-time"
},
"example": [
"2016-01-15T09:00:00.000Z",
"2017-01-15T09:00:00.000Z"
]
}

"There is no index available for this selector" despite the fact I made one

In my data, I have two fields that I want to use as an index together. They are sensorid (any string) and timestamp (yyyy-mm-dd hh:mm:ss).
So I made an index for these two using the Cloudant index generator. This was created successfully and it appears as a design document.
{
"index": {
"fields": [
{
"name": "sensorid",
"type": "string"
},
{
"name": "timestamp",
"type": "string"
}
]
},
"type": "text"
}
However, when I try to make the following query to find all documents with a timestamp newer than some value, I am told there is no index available for the selector:
{
"selector": {
"timestamp": {
"$gt": "2015-10-13 16:00:00"
}
},
"fields": [
"_id",
"_rev"
],
"sort": [
{
"_id": "asc"
}
]
}
What have I done wrong?
It seems to me like cloudant query only allows sorting on fields that are part of the selector.
Therefore your selector should include the _id field and look like:
"selector":{
"_id":{
"$gt":0
},
"timestamp":{
"$gt":"2015-10-13 16:00:00"
}
}
I hope this works for you!

Resources