Druid - descending timestamps with groupBy query - database

What I'm asking for should be very simple but the Druid docs have little to no info about this.
I am making a groupBy query, and the data is very large so I'm "paging" it by increasing limitSpec.limit on each subsequent query.
By default, the returned array starts from the beginning timestamp and moves forward in time. I want the results to start from the end timestamp and move backwards in time from there.
Does anyone know how to do that?
So in other words, by default a groupBy query would look like this:
[
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:00.000Z",
"event" : {
"total_usage" : <some_value_one>
}
},
{
"version" : "v1",
"timestamp" : "2012-01-02T00:00:00.000Z",
"event" : {
"total_usage" : <some_value_two>
}
}
]
Whereas I want it to look like this:
[
{
"version" : "v1",
"timestamp" : "2012-01-02T00:00:00.000Z",
"event" : {
"total_usage" : <some_value_two>
}
},
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:00.000Z",
"event" : {
"total_usage" : <some_value_one>
}
}
]

You can achieve the ordering by using the "columns" attribute in the limit spec. see the below example.
{
"type" : "default",
"limit" : <integer_value>,
"columns" : [list of OrderByColumnSpec],
}
For more details you can refer the below druid doc -
http://druid.io/docs/latest/querying/limitspec.html

You can add timestamp as a dimension but truncated to date (assuming you use day granularity in your query) and force Druid to sort the result first by dimension values and then by timestamp.
Example Query:
{
"dataSource": "your_datasource",
"queryType": "groupBy",
"dimensions": [
{
"type": "default",
"dimension": "some_dimension_in",
"outputName": "some_dimension_out",
"outputType": "STRING"
},
{
"type": "extraction",
"dimension": "__time",
"outputName": "__timestamp",
"extractionFn": {
"type": "timeFormat",
"format" : "yyyy-MM-dd"
}
}
],
"aggregations": [
{
"type": "doubleSum",
"name": "some_metric",
"fieldName": "some_metric_field"
}
],
"limitSpec": {
"type": "default",
"limit": 1000,
"columns": [
{
"dimension": "__timestamp",
"direction": "descending",
"dimensionOrder": "numeric"
},
{
"dimension": "some_metric",
"direction": "descending",
"dimensionOrder": "numeric"
}
]
},
"intervals": [
"2019-09-01/2019-10-01"
],
"granularity": "day",
"context": {
"sortByDimsFirst": "true"
}
}

Related

MongoDB 4.4: input to $filter must be an array not long

I've implemented the following aggregation in MongoDB 4.4.2 and it works fine:
[{
$match: {
"$expr": {
"$and": [{
"$not": "$history_to"
}, ],
},
}
}, {
$unwind: {
"path": "$used",
"preserveNullAndEmptyArrays": true,
}
}, {
$project: {
"ticket": "$$ROOT",
"status": {
"$map": {
"input": {
"$filter": {
"input": "$status",
"as": "cstatus",
"cond": {
"$not": "$$cstatus.history_to",
},
},
},
"as": "status",
"in": "$$status.value",
},
},
}
}]
But when I try it in MongoDB 4.4.4 I encounter input to $filter must be an array not long.
If it's any help, I figured that the cause of this error has something to do with:
"cond": {
"$not": "$$cstatus.history_to",
},
Since when I comment $not it works fine; At first I thought that maybe $not is not supported anymore but it is supported so I'm out of ideas.
Some example documents
{
"_id" : ObjectId("6083ca1ce151beea45602e5d"),
"created_at" : ISODate("2021-04-24T07:34:52.947Z"),
"ticket_id" : ObjectId("6083ca1c68badcedddd875ba"),
"expire_at" : ISODate("2021-04-24T19:30:00Z"),
"history_from" : ISODate("2021-04-24T07:34:52.992Z"),
"created_by" : ObjectId("604df7857d58ab06685ed02e"),
"serial" : "116627138",
"status" : [
{
"history_from" : ISODate("2021-04-24T07:34:52.985Z"),
"history_created_by" : ObjectId("604df7857d58ab06685ed02e"),
"value" : NumberLong(1)
}
],
"history_created_by" : ObjectId("604df7857d58ab06685ed02e")
},
{
"_id" : ObjectId("60713b0fe151beea45602e56"),
"created_by" : ObjectId("604df7857d58ab06685ed02e"),
"ticket_id" : ObjectId("60713b0f68badcedddd875b8"),
"created_at" : ISODate("2021-04-10T05:43:43.228Z"),
"history_created_by" : ObjectId("604df7857d58ab06685ed02e"),
"status" : [
{
"history_created_by" : ObjectId("604df7857d58ab06685ed02e"),
"value" : NumberLong(1),
"history_from" : ISODate("2021-04-10T05:43:43.277Z")
}
],
"serial" : "538142578",
"expire_at" : ISODate("2021-04-10T19:30:00Z"),
"history_from" : ISODate("2021-04-10T05:43:43.281Z"),}
Your query looks good as per your example documents you can check playground,
But when I try it in MongoDB 4.4.4, I encounter input to $filter must be an array not long.
It is not any MongoDB version specific issue, The error says provided field status as input in $filter is not array type it is long type, and $filter input should be an array type.
Definitely there are/is some document(s) having non array value in status field.
If you want to check you can match condition before $filter operation,
{ $match: { status: { $type: "array" } } } // 4 = "array"
This will filter documents by status, it should be an array type.

Looking up index in a MongoDB array

Our data provider supplies the data in a weird format. The arrays date and value are corresponding and guaranteed to have the same length. For whatever reason, they even decide to mix up int and string values in date.
[
{
"_id": "A000005933",
"date": [905270400000, 918748800000, 937843200000, 965923200000, 983289600000, 984931200000, 1152806400000, "1171987200000", "1225382400000", "1229616000000", "1286208000000", "1455552000000"],
"value": ["0.25", "0.15", "0", "0.25", "0.15", "0", "0.25", "0.5", "0.3", "0.1", "0.1", "-0.1"],
"version": 1614837436798
},
{
"_id": "A000005934",
"date": [915120000000, 923587200000, 941731200000, 949593600000, 953222400000, 956851200000, 962121600000, 967737600000, 970761600000, 989510400000, 999187200000, 1000742400000, 1005235200000, 1039104000000, 1046966400000, 1054828800000, 1133798400000, 1141747200000, 1150300800000, 1155052800000, 1160496000000, 1165939200000, 1173801600000, 1181664000000, 1215532800000, 1224000000000, 1226419200000, 1228838400000, 1232467200000, 1236700800000, 1239120000000, 1242144000000, 1302624000000, 1310486400000, 1320768000000, 1323792000000, 1341936000000, 1367942400000, 1384272000000, 1402416000000, 1410278400000, 1458057600000],
"value": ["3", "2.5", "3", "3.25", "3.5", "3.78", "4.25", "4.5", "4.78", "4.5", "4.25", "3.75", "3.25", "2.78", "2.5", "2", "2.25", "2.5", "2.75", "3", "3.25", "3.5", "3.75", "4", "4.25", "3.75", "3.25", "2.5", "2", "1.5", "1.25", "1", "1.25", "1.5", "1.25", "1", "0.75", "0.5", "0.25", "0.15", "0.05", "0"],
"version": 1614837436548
},
......
]
Our typical use case is to look up value based on _id and date, so I had to do something like this.
def get_value_from_mongo(id_: str, date: datetime.date) -> float:
result = db.indicators.find_one({"_id": _id}, {"value": 1, "date": 1})
date_list = list(map(str, result["date"]))
price_list = list(map(str, result["value"]))
dt = date.strftime("%s000")
price = float(price_list[date_list.index(dt)])
return price
This is hopelessly inefficient because the whole array is scanned each time I want to retrieve a single value. Maybe I could do a binary search, but date is not guaranteed to be sorted and I don't want to rely on that behavior.
Are there any MongoDB operators I can use to speed up the query?
A first possibility is to focus on the lookup: create an index on dates array
Which comes at the sake of a slower write.
In below execution plan you can see the index is used (you should benchmark if it brings that of an improvement)
> db.indicators.explain().find({dates: '1.1'})
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "dummy.indicators",
"indexFilterSet" : false,
"parsedQuery" : {
"dates" : {
"$eq" : "1.1"
}
},
"queryHash" : "4204704C",
"planCacheKey" : "1DBFE945",
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",// <------
"keyPattern" : {
"dates" : 1
},
"indexName" : "dates_1",
"isMultiKey" : true,
"multiKeyPaths" : {
"dates" : [
"dates"
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"dates" : [
"[\"1.1\", \"1.1\"]"
A second possibility is to focus on retrieving the minimal data possible
With hint that bottleneck is not the date lookup but the data transfer
Thus this does not improve the lookup (given you "iterate" your array on db side instead of application code side).
You can make it with the use of
the positional operator
the projection as second argument in find with mongo >= 4.4
db.indicators.remove({})
db.indicators.insert([{_id: '0', dates: [1, '1.1', 2], prices: [1,2,3]}])
fetch = date => {
print(date)
res = db.indicators.find(
{
dates: {
$elemMatch: {
$in: [Number(date), String(date)]
}
}
},
{
'prices.$': 1 // <<--------
}
).toArray()
printjson(res)
}
fetch(2) // [ { "_id" : "0", "prices" : [ 3 ] } ]
fetch('1.1') // [ { "_id" : "0", "prices" : [ 2 ] } ]
Obviously you can compose 1 and 2, but I would have a try with just 2 to avoid creating an index

Is it possible to apply a solr document int field value as boost value if a specific field is matched?

Ex.
"docs": [
{
"id": "f37914",
"index_id": "some_index",
"field_1": [
{
"Some value",
"boost": 20.
}
]
},
]
If 'field_1' is matched, then boost by corresponding 'boost' field.
Boost what? the document? the specific field? you can do any of them.
Anyway the way to do it is to user Function Queries:
https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions
For example if you want to boost the document (and assuming if the value doesn't match then the score is 0) then you can do something like that:
q:_val_:"if(query($q1), field(boost), 0)"&q1=field_1:"Some Value"
_val_ is just a hook into Solr function query, query returns true if q1 matches, field is a simple function that just return the value of the field it self and if allows us to join the two together.
So what I ended up doing is using lucence payloads and solr 6.6 new DelimitedPayloadTokenFilter feature.
First I created a terms field with the following configuration:
{
"add-field-type": {
"name": "terms",
"stored": "true",
"class": "solr.TextField",
"positionIncrementGap": "100",
"indexAnalyzer": {
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.DelimitedPayloadTokenFilterFactory",
"encoder": "float",
"delimiter": "|"
}
]
},
"queryAnalyzer": {
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
},
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.SynonymGraphFilterFactory",
"ignoreCase": "true",
"expand": "false",
"tokenizerFactory": "solr.KeywordTokenizerFactory",
"synonyms": "synonyms.txt"
}
]
}
},
"add-field" : {
"name":"terms",
"type":"terms",
"stored": "true",
"multiValued": "true"
}
}
I indexed my documents likes so:
[
{
"id" : "1",
"terms" : [
"some term|10.0",
"another term|60.0"
]
}
,
{
"id" : "2",
"terms" : [
"some term|11.0",
"another term|21.0"
]
}
]
I used solr's functional query support to query for a match on terms and grab the attached boost payload and apply it to the relevancy score:
/solr/payloads/select?indent=on&wt=json&q={!payload_score%20f=ai_terms_wtih_synm_3%20v=$payload_term%20func=max}&fl=id,score&payload_term=some+term

ElasticSearch : Combine fuzzy and phrase query results

I am building a search engine for a movie database with ElasticSearch.
My need is to combine a fuzzy query with a phrase query to have the benefits of the two :
Fuzzy : tolerate the mispelling
Phrase : give importance to the position of the words
I am also using the default tokeniser and some filters : asciifolding, elision, lowercase, worddelimitier and a custom stemmer (only for plurials).
This is my query :
POST _search
{
"query": {
"bool":{
"should": [
{"match": {
"Title" : {
"query": "scoobydoo",
"type": "phrase",
"slop" : 50
}
}},
{"match": {
"Title" : {
"query" : "scoobydoo",
"fuzziness": "1"
}
}}
]
}
}
}
This is my configuration :
PUT /canal/
{
"settings" : {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "elision", "lowercase", "worddelimiter", "my_stemmer"]
}
},
"filter": {
"elision": {
"type": "elision",
"articles": ["l", "m", "t", "qu", "n", "s", "j", "d"]
},
"worddelimiter": {
"type": "word_delimiter"
},
"my_stemmer" : {
"type" : "stemmer",
"name" : ["minimal_french", "minimal_english"]
}
}
}
},
"mappings" : {
"movies" : {
"properties" : {
"Title": {
"type" : "string",
"analyzer" : "custom_analyzer",
"search_analyzer": "custom_analyzer"
}
}
}
}
}
I have to admit that I am not very confident on how I make my query with the "should" and the two queries. So my question is : is mine the correct way of doing the query or could I do better ?
Thanks a lot !
Florent

Conditionally remove Subdocument nested inside array of document MongoDB

I have a collection with document like this:
{
"_id" : "ABC",
"Name" : "Rajesh",
"createstmp" : ISODate("2015-06-22T17:09:16.705Z"),
"updstmp" : ISODate("2015-06-22T19:31:53.527Z"),
"AvgValue" : "65",
"PreValues" : [
{
"Date" : 20150709,
"Rate" : [
{
"Time" : 1566,
"value" : 60
},
{
"Time" : 1500,
"value" : 400
},
{
"Time" : 1400,
"value" : 100
},
{
"Time" : 1500,
"value" : 103
}
]
}
]
}
I want to remove the duplicate doc for a particular Date value
eg If Time value is 1500, I need to pull the document and push it the new value for (Value) in single bulk operation.
Here is my query
bulk.find({ "_id":"ABC" })
.update(
{
"_id": "ABC",
"PreValues": { "Date": 20150709 }
},
{
$pu‌​ll: { "PreValues": { "Rate": { "Time": 1000 } } }
}
);
bulk.find({ "_id":"ABC" })
.update(
{ "_id": "ABC","PreValues": { "Date": 20150709 }},
{ $pu‌​sh : {
"PreValues": { "Rate": { "Time": 1000,"Rating": 100 }}
}}
);
bulk.execute();
It's not a great idea to have nested arrays since the only thing you will ever be able to do atomically is $push or $pull. See the positional $ operator for details on why "nested arrays" are not good here, but basically you can only ever match the position of the "outer" array element.
And that is basically what you are missing here, and of course the proper "dot notation" for accessing the elements:
var bulk = db.ABA.initializeOrderedBulkOp();
bulk.find({ "_id": "ABC", "PreValues.Date": 20150709 })
.updateOne({ "$pull": { "PreValues.$.Rate": { "Time": 1500 } } })
bulk.find({ "_id": "ABC", "PreValues.Date": 20150709 })
.updateOne({ "$push": { "PreValues.$.Rate": { "Time": 1500, "Rating": 100 } } })
bulk.execute();
Which alters the document like so:
{
"_id" : "ABC",
"Name" : "Rajesh",
"createstmp" : ISODate("2015-06-22T17:09:16.705Z"),
"updstmp" : ISODate("2015-06-22T19:31:53.527Z"),
"AvgValue" : "65",
"PreValues" : [
{
"Date" : 20150709,
"Rate" : [
{
"Time" : 1566,
"value" : 60
},
{
"Time" : 1400,
"value" : 100
},
{
"Time" : 1500,
"Rating" : 100
}
]
}
]
}
That is the correct syntax for both statements there and sends both requests to the server at the same time with a single response.
Note that you need to inclide in the .find() query a field from the outer array to match. This is so the positional $ operator is populated with the matched index of that element and the operator knows which array element to act upon.

Resources