bulk update in mongodb with array - database

I want to update different values for every id. And i have this array (with 10k records)
[{_id : "11", total : ""},{_id : "12", total : ""}]
for updateMany query is required and for bulkWrite, inserOne or updateOne is required.
Is there any other way to update the data with the format I have.

Related

Performance issue when querying time-based objects

I'm currently working on a mongoDB collection containing documents that looks like the following :
{ startTime : Date, endTime: Date, source: String, metaData: {}}
And my usecase is to retrieve all documents that is included within a queried time frame, such as my query looks like this :
db.myCollection.find(
{
$and: [
{"source": aSource},
{"startTime" : {$lte: timeFrame.end}},
{"endTime" : {$gte: timeFrame.start}}
]
}
).sort({ "startTime" : 1 })
With an index defined as the following :
db.myCollection.createIndex( { "source" : 1, "startTime": 1, "endTime": 1 } );
The problem is that queries are very slow (multiple hundreds of ms on a local database) as soon as the number of document per source increase.
Using mongo explain shows me that i'm efficiently using this index (only found documents are scanned, otherwise only index-access is made), so the slowness seems to come from the index scan itself, as this query needs to go over a large portion of this index.
In addition to that, such an index gets huge pretty quickly and therefore seems inefficient.
Is there anything i'm missing that could help makes those queries faster, or am I condemned to retrieve all the documents belonging to a given source as the best way to go ? I see that mongo now provides some time-series features, could that bring any help in regard of my problem ?

JSON data Aggregation in flink 1.10 DataStream API

I am trying to aggregate data in Elasticsearch using Kafka messages (as a Flink 1.10 API StreamSource). Data is receiving in JSON format which is dynamic and sample is given below.I want to combine multiple records in single document by unique ID. Data is coming in sequence wise and it's time series data.
source sink kafka and destination sink elasticseach 7.6.1 6
I am not found any good example which can be utilize in below problem statement.
Record : 1
{
"ID" : "1",
"timestamp" : "2020-05-07 14:34:51.325",
"Data" :
{
"Field1" : "ABC",
"Field2" : "DEF"
}
}
Record : 2
{
"ID" : "1",
"timestamp" : "2020-05-07 14:34:51.725",
"Data" :
{
"Field3" : "GHY"
}
}
Result :
{
"ID" : "1",
"Start_timestamp" : "2020-05-07 14:34:51.325",
"End_timestamp" : "2020-05-07 14:34:51.725",
"Data" :
{
"Field1" : "ABC",
"Field2" : "DEF",
"Field3" : "GHY"
}
}
Below is version details:
Flink 1.10
Flink-kafka-connector 2.11
Flink-Elasticsearch-connector 7.x
Kafka 2.11
JDK 1.8
What you're asking for could be described as some sort of join, and there are many ways you might accomplish this with Flink. There's an example of stateful enrichment in the Apache Flink Training that shows how to implement a similar join using a RichFlatMapFunction that should help you get started. You'll want to read through the relevant training materials first -- at least the section on Data Pipelines & ETL.
What you'll end up doing with this approach is to partition the stream by ID (via keyBy), and then use key-partitioned state (probably MapState in this case, assuming you have several attribute/value pairs to store for each ID) to store information from records like record 1 until you're ready to emit a result.
BTW, if the set of keys is unbounded, you'll need to take care that you don't keep this state forever. Either clear the state when it's no longer needed (as this example does), or use State TTL to arrange for its eventual deletion.
For more information on other kinds of joins in Flink, see the links in this answer.

DynamoDB How to design and query multiple fields

I have an item like this
{
"date": "2019-10-05",
"id": "2",
"serviceId": "1",
"time": {
"endTime": "1300",
"startTime": "1330"
}
}
Right now the way I design this is like so:
primary key --> id
Global secondary index --> primary key : serviceId
--> sort key : date
With the way I designed as of now,
* I can query the id
* I can query serviceId and range of date
I'd like to be able to query such that I can retrieve all items where
* serviceId = 1 AND
* date = "yyyy-mm-dd" AND
* time = {
"endTime": "1300",
"startTime": "1330"
}
I'd still like to be able to query based on the 2 previous condition (query by id, and query by serviceId and rangeOfDate
Is there a way to do this? one way I was thinking is to create a new field and use it as index e.g: combine all data so
combinedField: "1_yyyy-mm-dd_1300_1330
make that as primary key for global secondary index, and just query it like that.
I'm just not sure is this the way to do this or if there's a better or best practice way to do this?
Thank you
You could either use FilterExpression or composite sort keys.
FilterExpression
Here you could retrieve the items from the GSI you described by using specifying 'serviceId' and 'date' and then giving within the 'FilterExpression' specifying time.startTime and time.endTime. The sample Python code using boto3 would be as follows:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').eq("2019-10-05"),
FilterExpression=Attr(time.endTime).eq('1300') & Attr('time.startTime').eq('1330')
)
The drawback with this method is that all items specified with the sort key will be read and only then the results are filtered. So you will be charged according to what is specified in the sort key.
eg: if 1000 items have 'serviceId' as 1 and 'date' as '2019-10-05' but only 10 items have 'time.startTime' as 1330, then still you will be charged for reading the 1000 items even though only 10 items will be returned after the FilterExpression is applied.
Composite Sort Key
I believe this is the method you mentioned in the question. Here you will need to make an attribute as
'yyyy-mm-dd_startTime_endTime'
and use this as the sort key in your GSI. Now your items will look like this:
{ "date": "2019-10-05",
"id": "2",
"serviceId": "1",
"time": {
"endTime": "1300",
"startTime": "1330"
}
"date_time":"2019-10-05_1330_1300"
}
Your GSI will have 'serviceId' as partition key and 'date_time' as sort key. Now you will be able to query date range as:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').between('2019-07-05','2019-10-05')
)
For the query where date, start and end time are specified, you can query as:
response = table.query(
KeyConditionExpression=Key('serviceId').eq(1) & Key('date').eq('2019-10-05_1330_1300')
)
This approach won't work if you need range of dates and start and end time together ie. you won't be able to make a query for items in a particular date range containing a specific start and end time. In that case you would have to use FilterExpression.
Yes, the solution you suggested (add a new field which is the combination of the fields and defined a GSI on it) is the standard way to achieve that. You need to make sure that the character you use for concatenation is unique, i.e., it cannot appear in any of the individual fields you combine.

Getting the latest (timestamp wise) value from cloudant query

I have a cloudant DB where each document looks like:
{
"_id": "2015-11-20_attr_00",
"key": "attr",
"value": "00",
"employeeCount": 12,
"timestamp": "2015-11-20T18:16:05.366Z",
"epocTimestampMillis": 1448043365366,
"docType": "attrCounts"
}
For a given attribute there is an employee count. As you can see I have a record for the same attribute every day. I am trying to create a view or index that will give me the latest record for this attribute. Meaning if I inserted a record on 2015-10-30 and another on 2015-11-10, then the one that is returned to me is just employee count for the record with timestamp 2015-11-10.
I have tried view, but I am getting all the entries for each attribute not just the latest. I did not look at indexes because I thought they do not get pre calculated. I will be querying this from client side, so having it pre calculated (like views are) is important.
Any guidance would be most appreciated. thank you
I created a test database you can see here. Just make sure your when you insert your JSON document into Cloudant (or CouchDB), your timestamps are not strings but JavaScript data objects:
https://examples.cloudant.com/latestdocs/_all_docs?include_docs=true
I built a search index like this (name the design doc "summary" and the search index "latest"):
function (doc) {
if ( doc.docType == "totalEmployeeCounts" && doc.key == "div") {
index("division", doc.value, {"store": true});
index("timestamp", doc.timestamp, {"store": true});
}
}
Then here's a query that will return only the latest record for each division. Note that the limit value will apply to each group, so with limit=1, if there are 4 groups you will get 4 documents not 1.
https://examples.cloudant.com/latestdocs/_design/summary/_search/latest?q=*:*&limit=1&group_field=division&include_docs=true&sort_field=-timestamp
Indexing TimeStamp as a string is not recommended.
Reference:
https://cloudant.com/blog/defensive-coding-in-mapindex-functions/#.VvRVxtIrJaT
I have the same problem. I converted the timestamp value to milliseconds (number) and then indexed that value.
var millis= Date.parse(timestamp);
index("millis",millis,{"store": false});
You can use the same query as Raj suggested but with the 'millis' field instead of the timestamp .

Check first value in array, insert new conditionally

I have an array of "states" in my documents:
{
"_id: ObjectId("53026de61e30e2525d000004")
"states" : [
{
"name" : "complete",
"userId" : ObjectId("52f4576126cd0cbe2f000005"),
"_id" : ObjectId("53026e16c054fc575d000004")
},
{
"name" : "active",
"userId" : ObjectId("52f4576126cd0cbe2f000005"),
"_id" : ObjectId("53026de61e30e2525d000004")
}
]
}
I just insert a new state onto the front of the array when there is a new state. Current work around until mongo 2.6 is released here: Can you have mongo $push prepend instead of append?
However I do not want users to be able to save the same state twice in row. I.E. if its already complete you should not be able to add another 'complete' state. Is there a way that I can check the first element in the array and only insert the new state if its not the same in one query/update command to mongo.
I say one query/update due to the fact that mongo does not support transactions so I don't want to query for the first element in the array then send another update statement, as that could cause problems if another state got inserted between my query and my update.
You can qualify your update statement with a query, for example:
db.mydb.states.update({"states.name":{$nin:["newstate"]}},{$addToSet:{"states":{"name":"newstate"}}})
This will prevent updates from a user if the query part of the update returns no document. You can additionally add more fields to filter on on the query part.

Resources