Azure Stream Analytics–Querying JSON Arrays of arrays - arrays

I have a problem writing a query to extract a table out of the arrays from a json file:
The problem is how to get the information of the array “data packets” and its contents of arrays and then make them all in a normal sql table.
One hard issue there is the "CrashNotification" and "CrashMaxModuleAccelerations", I dont know how to define and use them.
The file looks like this:
{ "imei": { "imei": "351631044527130F", "imeiNotEncoded":
"351631044527130"
},
"dataPackets": [ [ "CrashNotification", { "version": 1, "id": 28 } ], [
"CrashMaxModuleAccelerations", { "version": 1, "module": [ -1243, -626,
14048 ] } ] ]}
I tried to use Get array elements method and other ways but I am never able to access 2nd level arrays like elements of "CrashNotification" of the "dataPackets" or elements of "module" of the array "CrashMaxModuleAccelerations" of the "dataPackets".
I looked also here (Select the first element in a JSON array in Microsoft stream analytics query) and it doesnt work.
I would appreciate any help :)

Based on your schema, here's an example of query that will extract a table with the following columns: emei, crashNotification_version, crashNotification_id
WITH Datapackets AS
(
SELECT imei.imei as imei,
GetArrayElement(Datapackets, 0) as CrashNotification
FROM input
)
SELECT
imei,
GetRecordPropertyValue (GetArrayElement(CrashNotification, 1), 'version') as crashNotification_version,
GetRecordPropertyValue (GetArrayElement(CrashNotification, 1), 'id') as crashNotification_id
FROM Datapackets
Let me know if you have any further question.
Thanks,
JS (Azure Stream Analytics)

We built a HTTP API called Stride for converting streaming JSON data into realtime, incrementally updated tables using only SQL.
All you'd need to do is write raw JSON data to the Stride API's /collect endpoint, define continuous SQL queries via the /process endpoint, and then push or pull data via the /analyze endpoint.
This approach eliminates the need to deal with any underlying data infrastructure and gives you a SQL-based approach to this type of streaming analytics problem.

Related

Azure Data Factory - converting lookup result array

I'm pretty new to Acure Data Factory - ADF and have stumbled into somthing I would have solved with a couple lines of code.
Background
Main flow:
Lookup Activity fetchin an array of ID's to process
ForEach Activity looping over input array and uisng a Copy Activity pulling data from a REST API storing it into a database
Step #1 would result in an array containing ID's
{
"count": 10000,
"value": [
{
"id": "799128160"
},
{
"id": "817379102"
},
{
"id": "859061172"
},
... many more...
Step #2 When the lookup returns a lot of ID's - individual REST calls takes a lot of time. The REST API supports batching ID's using a comma spearated input.
The question
How can I convert the array from the input into a new array with comma separated fields? This will reduce the number of Activities and reduce the time to run.
Expecting something like this;
{
"count": 1000,
"value": [
{
"ids": "799128160,817379102,859061172,...."
},
{
"ids": "n,n,n,n,n,n,n,n,n,n,n,n,...."
}
... many more...
EDIT 1 - 19th Des 22
Using "Until Activity" and keeping track of posistions, I managed to use plain ADF. Would be nice if this could have been done using some simple array manipulation in a code snippet.
The ideal response might be we have to do manipulation with Dataflow -
My sample input:
First, I took a Dataflow In that adding a key Generate (Surrogate key) after the source - Say new key field is 'SrcKey'
Data preview of Surrogate key 1
Add an aggregate where you group by mod(SrcKey/3). This will group similar remainders into the same bucket.
Add a collect column in the same aggregator to collect into an array with expression trim(toString(collect(id)),'[]').
Data preview of Aggregate 1
Store output in single file in blob storage.
OUTPUT

Read a Struct JSON with AWS Glue that is on a single line

I have this JSON on a Bucket that has been crawled with a classifier that splits arrays into record with this JSON classifier $[*].
I noticed that the JSON is on a single line - nothing wrong with syntax - but this results in the table being created having a single column of type array containing a struct which contains the actual fields I need.
In Athena I wasn't able to access the data and Glue was not able to read the columns as in array.field; so I manually changed the structure of the table making a single struct type with the other fields inside. This I'm able to query on Athena and get the Glue wizard to recognise the single columns as part of the struct.
When I create the job and map the fields accordingly (this is what is automatically generated, note the array.field notation applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("array.col1", "long", "col1", "long"), ("array.col2", "string", "col2", "string"), ("array.col3", "string", "col3", "string")], transformation_ctx = "applymapping1")) I test the output on a table in an S3 Bucket. The Job does not fail at all, BUT creates files in the Bucket that are empty!
Another thing I've tried is to modify the Source JSON and add return lines:
this is before:
[{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}]
this is after:
[
{"col1":322,"col2":299,"col3":1613552400000,"col4":"TEST","col5":"TEST"},
{"col1":2,"col2":0,"col3":1613552400000,"col4":"TEST","col5":"TEST"}
]
Having the file modified as stated before lets me correctly read and write data; this led me to believe that the problem is having a bad JSON at the beginning. Before asking to change the JSON is there something I can implement in my Glue Job (Spark 2.4, Python 3) to handle a JSON on a single line? I've searched everywhere but found nothing.
The end goal is to load data into Redshift, we're working S3 to S3 to check on why data isn't being read.
Thanks in advance for your time and consideration.

How to parse a specific data from a JSON string in SnowFlake?

I am very new to SnowFlake and I am trying to work on a dataset. The column I am interested in has multiple feedbacks combined into one in the JSON format and I want to dig only the relevant key. Here's the snapshot of lets say Column_X:
Looking for a way to parse this data in such a way that I have a new column like "riskIndicator" and "riskIndicator" with values 27, 74 as two new rows. I am attempting to parse like the code below but that's not working. Had a look at the javascript/UDF approach but looks complicated for this piece.
,get_path(parse_json("riskIndicatorLNInstantID"),'riskCode') as riskIndicator
I will be thankful for any kind of help/suggestion here.
Thank you.
So if the problem you are having is breaking up the json, you will want to use FLATTEN
with data as (
select parse_json('[{"description":"unable to paste json", "riskCode":"27","seq":1},{"description":"typing in json is painful", "riskCode":"74","seq":2}]') as json
)
select d.json
,f.value:riskCode as riskIndicator
from data d
,lateral flatten(input=>d.json) f;
gives:
JSON RISKINDICATOR
[{ "description": "unable to paste j... "27"
[{ "description": "unable to paste j... "74"
Lateral flatten can help extract the fields of a JSON object and is a very good alternative to extracting them one by one using the respective names. However, sometimes the JSON object can be nested and normally extracting those nested objects requires knowing their names
Docs Reference: https://community.snowflake.com/s/article/Dynamically-extract-multi-level-JSON-object-using-lateral-flatten

Get database last N data points from each node (Cloudant/couchdb)

TL;DR: MapReduce or POST request?
What is the correct(=most efficient) way to fetch the latest n data points of multiple sensors, from Cloudant or equivalent database?
Sensor data is stored in individual documents like this:
{
"_id": "2d26dbd8e655ae02bdab611afc92b6cf",
"_rev": "1-a64448521f05935b915e4bee12328e84",
"date": "2017-06-20T15:59:50.509Z",
"name": "Sensor01",
"temperature": 24.5,
"humidity": 45.3,
"rssi": -33
}
I want the fetch the latest 10 documents from sensor01-sensor99 so I can feed it to UI.
I have discovered few options:
1. Use map reduce function
Reduce each sensor data to array under sensor01, sensor02, etc...
E.g.
Map:
function (doc) {
if (doc.name && doc.temperature) emit(doc.name, doc.temperature);
}
Reduce:
function (keys, values, rereduce) {
var temp_arr=[];
for (i=0;i<values.length;i++)
{
temp_arr.push(values);
}
return temp_arr;
}
I couldn't get this to work, but I think the method should be viable.
2. Multi-document fetching
{
"queries":[
{sensor01},{sensor02},{sensor03} etc....
]};
Where each {sensor0x} is filtered using
{"startkey": [sensors[i],{}],"endkey": [sensors[i]],"limit": 5}
This way I can order documents using ?descending=true
I implemented it and it works. I have my doubts should I use this if I have 1000 sensors with 10000 data points each.
And for hundreds of sensors I need to send a very large POST request.
Something better?
Is my architecture even correct?
Storing sensor data individual documents, and then fill the UI by fetching all data through REST API.
Thank you very much!
There's nothing wrong with your method of storing one reading per document, but there's no truly efficient way of getting "the last n data points" for a number of sensors.
We could create a MapReduce function:
function (doc) {
if (doc.name && doc.temperature && doc.date) {
emit([doc.name, doc.date], doc.temperature);
}
}
This creates an indexed ordered on name and date.
We can access the most recent readings for a single sensor by querying the view:
/_design/mydesigndoc/_view/myview?startkey=["Sensor01","2020-01-01"]&descending=true&limit=10
This fetches readings for "Sensor01" in newest-first order:
startkey & endkey are reveresed when doing descending=true
descending= true means in reverese order
limit - the number of readings required (or n in your parlance)
This is a very efficient use of Cloudant/CouchDB but it only returns the last n readings for single sensor. To retrieve other sensors' data, additional API calls would be required.
Creating an index like this:
function (doc) {
if (doc.name && doc.temperature && doc.date) {
emit(doc.date, doc.temperature);
}
}
orders each reading by date. You can then retrieve the newest n readings with:
/_design/mydesigndoc/_view/myview?startkey="2020-01-01"&descending=true&limit=200
If all of your sensors are saving data at the same rate, then simply using a larger limit should get your the latest readings of all sensors.
This too is an efficient use of CouchDB/Cloudant.
You may also want to look at the built-in reducers (_count, _sum and _stats) to get the database to aggregate readings for you. They are a great way to create year/month/day groupings of IoT data.
In general, I would recommend not using custom reducers they are many times more inefficient than the built-in reducers which are written in Erlang.

Generalized way to extract JSON from a relational database?

Ok, maybe this is too broad for StackOverflow, but is there a good, generalized way to assemble data in relational tables into hierarchical JSON?
For example, let's say we have a "customers" table and an "orders" table. I want the output to look like this:
{
"customers": [
{
"customerId": 123,
"name": "Bob",
"orders": [
{
"orderId": 456,
"product": "chair",
"price": 100
},
{
"orderId": 789,
"product": "desk",
"price": 200
}
]
},
{
"customerId": 999,
"name": "Fred",
"orders": []
}
]
}
I'd rather not have to write a lot of procedural code to loop through the main table and fetch orders a few at a time and attach them. It'll be painfully slow.
The database I'm using is MS SQL Server, but I'll need to do the same thing with MySQL soon. I'm using Java and JDBC for access. If either of these databases had some magic way of assembling these records server-side it would be ideal.
How do people migrate from relational databases to JSON databases like MongoDB?
Here is a useful set of functions for converting relational data to JSON and XML and from JSON back to tables: https://www.simple-talk.com/sql/t-sql-programming/consuming-json-strings-in-sql-server/
SQL Server 2016 is finally catching up and adding support for JSON.
The JSON support still does not match other products such as PostgreSQL, e.g. no JSON-specific data type is included. However, several useful T-SQL language elements were added that make working with JSON a breeze.
E.g. in the following Transact-SQL code a text variable containing a JSON string is defined:
DECLARE #json NVARCHAR(4000)
SET #json =
N'{
"info":{
"type":1,
"address":{
"town":"Bristol",
"county":"Avon",
"country":"England"
},
"tags":["Sport", "Water polo"]
},
"type":"Basic"
}'
and then, you can extract values and objects from JSON text using the JSON_VALUE and JSON_QUERY functions:
SELECT
JSON_VALUE(#json, '$.type') as type,
JSON_VALUE(#json, '$.info.address.town') as town,
JSON_QUERY(#json, '$.info.tags') as tags
Furhtermore, the OPENJSON function allows to return elements from referenced JSON array:
SELECT value
FROM OPENJSON(#json, '$.info.tags')
Last but not least, there is a FOR JSON clause that can format a SQL result set as JSON text:
SELECT object_id, name
FROM sys.tables
FOR JSON PATH
Some references:
https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server
https://learn.microsoft.com/en-us/sql/relational-databases/json/convert-json-data-to-rows-and-columns-with-openjson-sql-server
https://blogs.technet.microsoft.com/dataplatforminsider/2016/01/05/json-in-sql-server-2016-part-1-of-4/
https://www.red-gate.com/simple-talk/sql/learn-sql-server/json-support-in-sql-server-2016/
I think one 'generalized' solution will be as follows:-
Create a 'select' query which will join all the required tables to fetch results in a 2 dimentional array (like CSV / temporary table, etc)
If each row of this join is unique, and the MongoDB schema and the columns have one to one mapping, then its all about importing this CSV/Table using MongoImport command with required parameters.
But a case like above, where a given Customer ID can have an array of 'orders', needs some computation before mongoImport.
You will have to write a program which can 'vertical merge' the orders for a given customer ID.For small set of data, a simple java program will work. But for larger sets, parallel programming using spark can do this job.
SQL Server 2016 now supports reading JSON in much the same way as it has supported XML for many years. Using OPENJSON to query directly and JSON datatype to store.
There is no generalized way because SQL Server doesn’t support JSON as its datatype. You’ll have to create your own “generalized way” for this.
Check out this article. There are good examples there on how to manipulate sql server data to JSON format.
https://www.simple-talk.com/blogs/2013/03/26/sql-server-json-to-table-and-table-to-json/

Resources