I am working with Cosmos DB and I want to write a SQL query that returns multiple document in one single embed documents.
To elaborate, imagine you have the following two document types in one container. OrderId of Order document has reference in OrderDetail document.
1.Order
{
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name": "ABC",
"Type": "Order",
"DeptName": "ABC",
"TotalAmount": 100.05
}
2.OrderDetail
{
"OrderDetailId": "689bdc38-9849-4a11-b856-53f8628b76c9",
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Type": "OrderDetail",
"ItemNo": 202,
"Quantity": 10,
"UnitPrice": 10.05
},
{
"OrderDetailId": "789bdc38-9849-4a11-b856-53f8628b76c9",
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Type": "OrderDetail",
"ItemNo": 200,
"Quantity": 11,
"UnitPrice": 15.05
}
I want to write a query that will return all entries of OrderDetail in one array based on reference OrderId="31d4c08b-ee59-4ede-b801-3cacaea38808"
Output should be like below
{
"OrderId":"31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name":"ABC",
"Type":"Order",
"OrderDetail":[
{
"OrderDetailId":"689bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":202,
"Quantity":10,
"UnitPrice":10.05
},
{
"OrderDetailId":"789bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":200,
"Quantity":11,
"UnitPrice":15.05
}
]
}
I have no idea how to query in Cosmosdb to get the above result.
Your desired output should be applied in relational database,Cosmos db is non-relational db which is not appropriate for your scenario. Per my knowledge, no query sql could produce above output directly.
I suggest you executing 2 sqls, one produces:
{"OrderId":"31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name":"ABC",
"Type":"Order"}
other one produces:
"OrderDetail":[
{
"OrderDetailId":"689bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":202,
"Quantity":10,
"UnitPrice":10.05
},
{
"OrderDetailId":"789bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":200,
"Quantity":11,
"UnitPrice":15.05
}
]
Then combine them. Surely,you could do such process in Stored Procedure.
Related
The document structure has a round collection, which has an array of holes Objects embedded within it, with each hole played/scored entered.
The structure looks like this (there are more fields, but this summarises):
{
"_id": {
"$oid": "60701a691c071256e4f0d0d6"
},
"schema": {
"$numberDecimal": "1.0"
},
"playerName": "T Woods",
"comp": {
"id": {
"$oid": "607019361c071256e4f0d0d5"
},
"name": "US Open",
"tees": "Pro Tees",
"roundNo": {
"$numberInt": "1"
},
"scoringMethod": "Stableford"
},
"holes": [
{
"holeNo": {
"$numberInt": "1"
},
"holePar": {
"$numberInt": "4"
},
"holeSI": {
"$numberInt": "3"
},
"holeGross": {
"$numberInt": "4"
},
"holeStrokes": {
"$numberInt": "1"
},
"holeNett": {
"$numberInt": "3"
},
"holeGrossPoints": {
"$numberInt": "2"
},
"holeNettPoints": {
"$numberInt": "3"
}
}
]
}
In the Atlas web UI, it shows as (note there are 9 holes in this particular round of golf - limited to 3 for brevity):
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf (i.e. a birdie on par 3 or better).
Being new to MongoDB, and NoSQL constructs, I am stuck with this. Reading around the aggregation pipeline framework, I have tried to break down the stages I will need as:
Filter by the comp.id and comp.roundNo
Filter this result with any hole within the holes array of Objects
Maybe I have approached this wrong, and should filter or structure this pipeline differently?
So far, using the Atlas web UI, I can apply these filters individually as:
{
"comp.id": ObjectId("607019361c071256e4f0d0d5"),
"comp.roundNo": 2
}
And:
{ "holes.0.holeGross": 2 }
But I have 2 problems:
The second filter query, I have hard-coded the array index to get this value. I would need to search across all the sub-elements of every document that matches this comp.id && comp.roundNo
How do I combine these? I presuming this is where the aggregation comes in, as well as enumerating across the whole array (as above).
I note in particular it is the extra ".0." part of the second query that I am not seeing from various other online postings trying to do the same thing. Is my data structure incorrect? Do I need the [0]...[17] Objects for an 18-hole round of golf?
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf
if that is the goal, a simple $lte search inside the holes array like the following would do:
db.collection.find({ "holes.holeGross": { $lte: 2 } })
you simply have to not specify an array index such as 0 in the property path in order to search each element of the array.
https://mongoplayground.net/p/KhZLnj9mJe5
I'm trying to store IoT Data from data loggers that can have a variety of sensors attached, below is an example. Each logger sends an MQTT message every 20 seconds
"state": {
"reported": {
"batv": 5105,
"ts": 1614595073655,
"temp": 20,
"humidity": 50
}
}
My Question is in terms of storing these MQTT messages/readings efficiently in a DynamoDB table, should i store the readings in a Map containing Maps like this. (Note this is currently what I'm doing and when the number of readings gets large, it is very slow to load in AWS DynamoDB console.)
{
"readings": {
"ts1614592810955": {
"battery_level": 5089,
"temp": 20,
"humidity": 50
},
"ts1614593692395": {
"battery_level": 5093,
"temp": 20,
"humidity": 50
}
},
"serial_number": "TDG_logger_thing"
}
The alternative which I'm leaning towards, is by storing readings in a list
{
"readings": [
{
"batv": 5105,
"ts": 1614594313407,
"temp": 20,
"humidity": 50
},
{
"batv": 5105,
"ts": 1614594313555,
"temp": 20,
"humidity": 50
}
],
"serial_number": "TDG_Logger_Thing"
}
Anyone with knowledge on DynamoDB or storing IoT data have any suggestions? greatly appreciated
(BTW The flow of data is)
Data Logger -> AWS IoT -> AWS Lambda -> DynamoDB
DDB List operations can be a limiting factor when you have use cases like trying to reliably modify attributes held in the List
Example - List
In a List, to set temp to 30 where ts = 1614594313407, you would need to fetch the List from DDB, search / traverse each object until ts = 1614594313407, set temp to 30, then write the whole List back to DDB. Not quite transactional
[
{
"batv": 5105,
"ts": 1614594313407,
"temp": 20,
"humidity": 50
},
{
"batv": 5105,
"ts": 1614594313555,
"temp": 20,
"humidity": 50
}
]
Example - Map
With a Map, you can update the value of temp to 30 where ts = ts1614592810955 in a single update "SET readings.#ts_id.temp = :temp_val" reliably
{
"readings": {
"ts1614592810955": {
"battery_level": 5089,
"temp": 20,
"humidity": 50
},
"ts1614593692395": {
"battery_level": 5093,
"temp": 20,
"humidity": 50
}
},
"serial_number": "TDG_logger_thing"
}
I would not use a map or a list and split those readings and store them in separate items. With the same partition key like the device id, combined with a sort key for every reading, also including the timestamp. That way you can more easily query for all temp data and with the timestamp in the sort key you could use the query to fetch only the measurements from a specific period.
so primary key would be:
PK[device id] - SK[Measurement type - Data time] : (Attributes per measurement)
After that you can store whatever data you need for each individual measurement. and you can quickly update and retrieve individual measurements, hope it helps.
I want to query the array field from elasticsearch. I have an array field that contains one or several node numbers of a gpu that were allocated to a job. Different people may be using the same node at the same time given that some people may be sharing the same gpu node with others. I want get the total number of distinct nodes that were used at a specific time.
Say I have three rows of data which fall in the same time interval. I want to plot a histogram showing that there are three nodes occupied in that period. Can I achieve this on Kibana?
Example :
[3]
[3,4,5]
[4,5]
I am expecting an output of 3 since there were only 3 distinct nodes used.
Thanks in advance
You can accomplish this using a combination of a date histogram aggregation along with either a terms aggregation (if the exact number of nodes is important) or a cardinality aggregation (if you can accept some inaccuracy at higher cardinalities).
Full example:
# Start with a clean slate
DELETE test-index
# Create the index
PUT test-index
{
"mappings": {
"event": {
"properties": {
"nodes": {
"type": "integer"
},
"timestamp": {
"type": "date"
}
}
}
}
}
# Index a few events (using the rows from your question)
POST test-index/event/_bulk
{"index":{}}
{"timestamp": "2018-06-10T00:00:00Z", "nodes":[3]}
{"index":{}}
{"timestamp": "2018-06-10T00:01:00Z", "nodes":[3,4,5]}
{"index":{}}
{"timestamp": "2018-06-10T00:02:00Z", "nodes":[4,5]}
# STRATEGY 1: Cardinality aggregation (scalable, but potentially inaccurate)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"cardinality": {
"field": "nodes"
}
}
}
}
}
}
# STRATEGY 2: Terms aggregation (exact, but potentially much more expensive)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"terms": {
"field": "nodes",
"size": 10
}
}
}
}
}
}
Notes:
Terms vs. cardinality aggregation: Use the cardinality agg unless you need to know WHICH nodes are in use. It is significantly more scalable, and until you get into cardinality of 1000s, you likely won't see any inaccuracy.
Date histogram interval: You can play with the interval such that it's something that makes sense for you. If you run through the example above, you'll only see one histogram bucket, however if you change hour to minute, you'll see the histogram build itself out with more data points.
I am using solr to retrieve results from a cassandra table.
Table structure:
CREATE TABLE mytable (
field1 uuid,
field2 text ,
bfield blob,
custmdata_<text, text>,
PRIMARY KEY (field1)
);
Table content
INSERT INTO mytable VALUES ( 62c36092-82a1-3a00-93d1-46196ee77204,"test1", { 'custmdata_data1' : 'data1value', 'custmdata_data2' : 'data2value' });
INSERT INTO mytable VALUES ( e26690db-dd54-4b61-b002-d3c07125f359,"test2", { 'custmdata_data5' : 'data5value', 'custmdata_data1' : 'mydata1value' });
I am able to retrieve the results using solr query.
{
"responseHeader": {
"status": 0,
"QTime": 1
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"field1": "62c36092-82a1-3a00-93d1-46196ee77204",
"field2": "test1",
"custmdata_data1":"data1value",
"custmdata_data2" : "data2value"
},
{
"field1": "e26690db-dd54-4b61-b002-d3c07125f359",
"field2": "test2",
"custmdata_data5":"data5value",
"custmdata_data1" : "mydata1value"
}
]
}
}
Is there any way to specify the field name in result so that I can retrieve the dynamic fields without having the field name prefix? I need result like this:
{
"responseHeader": {
"status": 0,
"QTime": 1
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"field1": "62c36092-82a1-3a00-93d1-46196ee77204",
"field2": "test1",
"data1":"data1value",
"data2" : "data2value"
},
{
"field1": "e26690db-dd54-4b61-b002-d3c07125f359",
"field2": "test2",
"data5":"data5value",
"data1" : "mydata1value"
}
]
}
}
Update:
From datastax documentaion, I found that,
Avoid or limit the use of dynamic fields. Lucene allocates memory for
each unique field (column) name, so if you have a row with columns A,
B, C, and another row with B, D, E, Lucene allocates 5 chunks of
memory. For millions of rows, the heap is unwieldy.
So is there a better way to achieve dynamic field based filtering in Solr? What I need is to filter against custom fields that may vary for each insert.
Instead of calling your dyn field custmdata_ call it data and that should get rid of the bit you don't want.
Otherwise removing the dyn field prepended label is not supported but you can rename returned fields with fl: https://wiki.apache.org/solr/CommonQueryParameters#Field_alias
If you're worried about having too many dyn fields, try to workaround it with some C* collection type if you scenario lends itself to that.
In Case 4 of this page, the query searches for all chairs less than 70 units in height:
curl localhost:9200/example/product/_search -d '{
"query": {
"filtered": {
"query": {
"match": {
"name": "chair"
}
},
"filter": {
"numeric_range": {
"size.height": {
"lt": 70
}
}
}
}
}
}'
Result:
"hits": [
{
"_id": "0",
"_source": {
"product": "chair",
"size": [
{
"width": 50,
"height": 50,
"depth": 50
},
{
"width": 75,
"height": 75,
"depth": 75
}
]
}
}
]
1) why is the ID 0 for both chair sizes?
2) why does the response show dimensions for the other chair that is 75 units in height?
1) The writer wanted to show 1 to N relation. Meaning there are 2 (In this case) types of chairs in his repository: A chair with dimensions of 50 and a chair with dimensions of 75. But both of them are still chairs and the id of a chair is 0.
2) Because by default ES doesn't return partial results, it returns documents. In our case we have a chair document with a size array which holds 2 objects: One for the 50 dimension and one for the 75 dimension. The supplied query can either select the whole document or not.
If you want to convert the query to English you may say: Bring me all the documents which have the value "chair" in the product field and at least one of its size.height values is lower than 70.
Even though the writer of the article is knowledgeable, I must say I don't like this kind of articles that trying to draw a direct flow between the SQL world to NOSQL implementation. If it was so easy, some big company would have write an automatic script that converts your SQL schemas to various NOSQL formats. In order to model your data correctly in NOSQL you must understand your products, understand the factors that should influence on your decision, understand the use case and the data. There is no one universal solution that will tell you: If you did it this way in a RDBMS do it like this in ES.