Multiple sorting in ArangoDB - graph-databases

My webapp needs to display several sorted lists of document attributes in a graph. These are hours, cycles, and age.
I have an AQL query that beautifully traverses the graph and gets me all the data my app needs in 2 ms. I'm very impressed! But I need it sorted for each graph. The query currently returns an array of json objects that contain all three of the attributes and the id for which they apply. Awesome. The query also very easily sorts on one of the attributes.
My problem is: I need to have a sorted list of all three, and would prefer not to query the database three times since the data is all in the same documents my traversal returned.
I would like to return three sorted arrays of json objects: one containing hours and the id, one containing cycles and the id, and one containing age and the id. This way, my graphs can easily display all three graphs without client-side sorting.
HTTP requests themselves are time consuming although the database is very fast, which is why I'd like to pull all three at once, as the data itself is small.
My current query is a simple graph traversal:
for v, e, p in outbound startNode graph 'myGraph'
filters & definitions...
sort v.hours desc
return {"hours": v.hours, "cycles": v.cycles, "age": v.age, "id": v.id}
Is there an easy way I can tell Arango to return me this structure?
{
[
{
"id": 47,
"hours": 123
},
{
"id": 23,
"hours": 105
}...
],
[
{
"id": 47,
"cycles": 18
},
{
"id": 23,
"cycles": 5
}...
],
[
{
"id": 47,
"age": 4.2
},
{
"id": 23,
"age": 0.9
}
]
}
Although the traversal is fast, I would prefer if I didn't have to re-traverse the graph three times to do it, if possible.

My solution:
let data = (for v, e, p in outbound startNode graph 'myGraph'
filters & definitions...
return {"hours": v.hours, "cycles": v.cycles, "age": v.age, "id": v.id})
let byHours = (for thing in data
sort thing.hours desc
return {"hours": thing.hours, "id": thing.id})
let byCycles = (for thing in data
sort thing.cycles desc
return {"cycles": thing.cycles, "id": thing.id})
let byAge = (for thing in data
sort thing.age desc
return {"age": thing.age, "id": thing.id})
return {"hours": byHours, "cycles": byCycles, "age": byAge}

I'm not sure how this compares against your solution performance-wise, but the most obvious solution would be to traverse once and then create three sorted results like this:
LET nodes = (
FOR v, e, p IN OUTBOUND startNode GRAPH 'myGraph'
FILTER ...
RETURN v
)
RETURN {
hours: (
FOR n IN nodes
SORT n.hours DESC
RETURN KEEP(n, ['hours', 'id'])
),
cycles: (
FOR n IN nodes
SORT n.cycles DESC
RETURN KEEP(n, ['cycles', 'id'])
),
age: (
FOR n IN nodes
SORT n.age DESC
RETURN KEEP(n, ['age', 'id'])
)
}
This would traverse the graph only once but sort the result three times.

Related

DynamoDB Maps vs Lists for storing IoT Data

I'm trying to store IoT Data from data loggers that can have a variety of sensors attached, below is an example. Each logger sends an MQTT message every 20 seconds
"state": {
"reported": {
"batv": 5105,
"ts": 1614595073655,
"temp": 20,
"humidity": 50
}
}
My Question is in terms of storing these MQTT messages/readings efficiently in a DynamoDB table, should i store the readings in a Map containing Maps like this. (Note this is currently what I'm doing and when the number of readings gets large, it is very slow to load in AWS DynamoDB console.)
{
"readings": {
"ts1614592810955": {
"battery_level": 5089,
"temp": 20,
"humidity": 50
},
"ts1614593692395": {
"battery_level": 5093,
"temp": 20,
"humidity": 50
}
},
"serial_number": "TDG_logger_thing"
}
The alternative which I'm leaning towards, is by storing readings in a list
{
"readings": [
{
"batv": 5105,
"ts": 1614594313407,
"temp": 20,
"humidity": 50
},
{
"batv": 5105,
"ts": 1614594313555,
"temp": 20,
"humidity": 50
}
],
"serial_number": "TDG_Logger_Thing"
}
Anyone with knowledge on DynamoDB or storing IoT data have any suggestions? greatly appreciated
(BTW The flow of data is)
Data Logger -> AWS IoT -> AWS Lambda -> DynamoDB
DDB List operations can be a limiting factor when you have use cases like trying to reliably modify attributes held in the List
Example - List
In a List, to set temp to 30 where ts = 1614594313407, you would need to fetch the List from DDB, search / traverse each object until ts = 1614594313407, set temp to 30, then write the whole List back to DDB. Not quite transactional
[
{
"batv": 5105,
"ts": 1614594313407,
"temp": 20,
"humidity": 50
},
{
"batv": 5105,
"ts": 1614594313555,
"temp": 20,
"humidity": 50
}
]
Example - Map
With a Map, you can update the value of temp to 30 where ts = ts1614592810955 in a single update "SET readings.#ts_id.temp = :temp_val" reliably
{
"readings": {
"ts1614592810955": {
"battery_level": 5089,
"temp": 20,
"humidity": 50
},
"ts1614593692395": {
"battery_level": 5093,
"temp": 20,
"humidity": 50
}
},
"serial_number": "TDG_logger_thing"
}
I would not use a map or a list and split those readings and store them in separate items. With the same partition key like the device id, combined with a sort key for every reading, also including the timestamp. That way you can more easily query for all temp data and with the timestamp in the sort key you could use the query to fetch only the measurements from a specific period.
so primary key would be:
PK[device id] - SK[Measurement type - Data time] : (Attributes per measurement)
After that you can store whatever data you need for each individual measurement. and you can quickly update and retrieve individual measurements, hope it helps.

dataweave filter and maxBy on nested array list

I have a list of students and their marks for respective subjects. I want to filter all students of a specific grades and then find the student who got maximum marks in a specific object.
[
{
"name": "User 01",
"grade": 1,
"schoolName": "school01",
"marks": {
"english": 10,
"math": 30,
"social": 30
}
},
{
"name": "User 02",
"grade": 1,
"schoolName": "school02",
"marks": {
"english": 10,
"math": 20,
"social": 30
}
}
]
I am able to perform both the operations independently. can someone help me find the student object who got max marks in math in a specific grade.
If I understand your requirement correctly this script does it. Just change the variables grade and topic to the specific values you are interested in.
Generally speaking it is always better to provide example outputs and whatever you got as script to understand better the context, in addition to the input samples.
%dw 2.0
output application/json
var grade = 1
var topic = "math"
---
flatten(
payload map (alumn, order) ->
(alumn.marks pluck ((value, key, index) ->
{
name: alumn.name,
grade: alumn.grade,
result:value,
topic: key
})
)
) // restructure the list to one result per element
filter ((item, index) -> (item.grade == grade)) // filter by grade
maxBy ((item) -> item.result) // get the maximum result
I used it below to achieve it.
%dw 2.0
output application/json
var grade = 1
var topic = "math"
---
payload filter (
((item, index) -> item.grade == grade)
) maxBy ($.marks.math as String {format: "000000"})

Elasticsearch query array field across documents

I want to query the array field from elasticsearch. I have an array field that contains one or several node numbers of a gpu that were allocated to a job. Different people may be using the same node at the same time given that some people may be sharing the same gpu node with others. I want get the total number of distinct nodes that were used at a specific time.
Say I have three rows of data which fall in the same time interval. I want to plot a histogram showing that there are three nodes occupied in that period. Can I achieve this on Kibana?
Example :
[3]
[3,4,5]
[4,5]
I am expecting an output of 3 since there were only 3 distinct nodes used.
Thanks in advance
You can accomplish this using a combination of a date histogram aggregation along with either a terms aggregation (if the exact number of nodes is important) or a cardinality aggregation (if you can accept some inaccuracy at higher cardinalities).
Full example:
# Start with a clean slate
DELETE test-index
# Create the index
PUT test-index
{
"mappings": {
"event": {
"properties": {
"nodes": {
"type": "integer"
},
"timestamp": {
"type": "date"
}
}
}
}
}
# Index a few events (using the rows from your question)
POST test-index/event/_bulk
{"index":{}}
{"timestamp": "2018-06-10T00:00:00Z", "nodes":[3]}
{"index":{}}
{"timestamp": "2018-06-10T00:01:00Z", "nodes":[3,4,5]}
{"index":{}}
{"timestamp": "2018-06-10T00:02:00Z", "nodes":[4,5]}
# STRATEGY 1: Cardinality aggregation (scalable, but potentially inaccurate)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"cardinality": {
"field": "nodes"
}
}
}
}
}
}
# STRATEGY 2: Terms aggregation (exact, but potentially much more expensive)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"terms": {
"field": "nodes",
"size": 10
}
}
}
}
}
}
Notes:
Terms vs. cardinality aggregation: Use the cardinality agg unless you need to know WHICH nodes are in use. It is significantly more scalable, and until you get into cardinality of 1000s, you likely won't see any inaccuracy.
Date histogram interval: You can play with the interval such that it's something that makes sense for you. If you run through the example above, you'll only see one histogram bucket, however if you change hour to minute, you'll see the histogram build itself out with more data points.

How can I sort based on children?

Using Vapor and Fluent (PostgreSQL if that matters) I have entity B that has aID: Node (A is B's parent) to reference A and A has a one-to-many relationship with B. How can I make a query to fetch all A's sorted by the count of B's?
I want the result to look something like this:
All A's in DB
[
{
"id": 4,
"name": "Hi",
"bCount": 1000
},
{
"id": 3,
"name": "Another",
"bCount": 800
},
{
"id": 5,
"name": "Test",
"bCount": 30
}
]
Firstly,
Create a modal for A
Turn that JSON string to array of A
if you do this your sorting becomes as easy as -
array.sort { $0.bCount < $1.bCount }
This is going to be tricky to implement entirely in Fluent using Entity. Firstly, you will need to use raw SQL to get your bCount. Secondly, you will need to change your init(node:) to accept bCount, though it shouldn't be in your makeNode() because we don't want to create a stored database field for it.
Try this for your raw SQL (untested):
SELECT
A.*,
(
SELECT COUNT(*)
FROM B
WHERE B.aID = A.id
) AS bCount
FROM A
ORDER BY bCount
Then, run that query to get your A models.
var models: [A] = []
if let driver = drop.database?.driver as? PostgreSQLDriver {
if case .array(let array) = try driver.raw(sql) {
for result in array {
do {
var model = try A(node: result)
models.append(model)
}
}
}
}
As I said before, your init method on A will be receiving bCount so you will need to store it there.

Elasticsearch constant score sort

I have a pretty simple elasticsearch query where I filter some items by category. It's a constant score query, something like this:
"query": {
"constant_score": {
"filter": {
"term": {
"category": "[category-id]"
}
}
}
}
The problem is that having no score to sort these result by they don't always come back in the same order. And this is an issue, because it messes up my pagination.
An example. I request the first 5 items and I receive back let's say the following ids: [4, 7, 8, 10, 3]. I then want the next 5 items to display the next page, but I may get some items repeated, like this: [12, 15, 7, 13, 9].
The problem is that all my fields are string fields, and I wouldn't want to sort by any of them. The sort order is not important, it's just important to keep the same order every time.
Any ideas? Thanks!
Try this:
GET _search
{
"query": {
"bool": {
"filter": {
"term": {
"category": "[category-id]"
}
}
}
}
}
Since this is what used to be known as a filtered query no scores are calculated and the score field will have value of 0.

Resources