ElasticSearch : Combine fuzzy and phrase query results - database

I am building a search engine for a movie database with ElasticSearch.
My need is to combine a fuzzy query with a phrase query to have the benefits of the two :
Fuzzy : tolerate the mispelling
Phrase : give importance to the position of the words
I am also using the default tokeniser and some filters : asciifolding, elision, lowercase, worddelimitier and a custom stemmer (only for plurials).
This is my query :
POST _search
{
"query": {
"bool":{
"should": [
{"match": {
"Title" : {
"query": "scoobydoo",
"type": "phrase",
"slop" : 50
}
}},
{"match": {
"Title" : {
"query" : "scoobydoo",
"fuzziness": "1"
}
}}
]
}
}
}
This is my configuration :
PUT /canal/
{
"settings" : {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["asciifolding", "elision", "lowercase", "worddelimiter", "my_stemmer"]
}
},
"filter": {
"elision": {
"type": "elision",
"articles": ["l", "m", "t", "qu", "n", "s", "j", "d"]
},
"worddelimiter": {
"type": "word_delimiter"
},
"my_stemmer" : {
"type" : "stemmer",
"name" : ["minimal_french", "minimal_english"]
}
}
}
},
"mappings" : {
"movies" : {
"properties" : {
"Title": {
"type" : "string",
"analyzer" : "custom_analyzer",
"search_analyzer": "custom_analyzer"
}
}
}
}
}
I have to admit that I am not very confident on how I make my query with the "should" and the two queries. So my question is : is mine the correct way of doing the query or could I do better ?
Thanks a lot !
Florent

Related

How to use mongodb projection query in array elements

I would like to project only "type":"A" in menu. is there a way to achieve this?
"_id" : "1",
"menu" : [
{
"type" : "A",
"items" : [
{
"key" : "Add",
"enabled" : true,
} ]
},
{
"type" : "B",
"items" : [
{
"key" : "Add",
"enabled" : true,
} ]
}
]
}
If you want tou output only the menu which type is A you can use $elemMatch in a find query (take care because this only return the first matched element)
db.collection.find({},
{
"menu": {
"$elemMatch": {
"type": "A"
}
}
})
Example here
Or $filter in an aggregate query (this returns all objects with type: "A" in the array):
db.collection.aggregate([
{
"$project": {
"menu": {
"$filter": {
"input": "$menu",
"cond": {
"$eq": [
"$$this.type",
"A"
]
}
}
}
}
}
])
Example here

MongoDB 4.4: input to $filter must be an array not long

I've implemented the following aggregation in MongoDB 4.4.2 and it works fine:
[{
$match: {
"$expr": {
"$and": [{
"$not": "$history_to"
}, ],
},
}
}, {
$unwind: {
"path": "$used",
"preserveNullAndEmptyArrays": true,
}
}, {
$project: {
"ticket": "$$ROOT",
"status": {
"$map": {
"input": {
"$filter": {
"input": "$status",
"as": "cstatus",
"cond": {
"$not": "$$cstatus.history_to",
},
},
},
"as": "status",
"in": "$$status.value",
},
},
}
}]
But when I try it in MongoDB 4.4.4 I encounter input to $filter must be an array not long.
If it's any help, I figured that the cause of this error has something to do with:
"cond": {
"$not": "$$cstatus.history_to",
},
Since when I comment $not it works fine; At first I thought that maybe $not is not supported anymore but it is supported so I'm out of ideas.
Some example documents
{
"_id" : ObjectId("6083ca1ce151beea45602e5d"),
"created_at" : ISODate("2021-04-24T07:34:52.947Z"),
"ticket_id" : ObjectId("6083ca1c68badcedddd875ba"),
"expire_at" : ISODate("2021-04-24T19:30:00Z"),
"history_from" : ISODate("2021-04-24T07:34:52.992Z"),
"created_by" : ObjectId("604df7857d58ab06685ed02e"),
"serial" : "116627138",
"status" : [
{
"history_from" : ISODate("2021-04-24T07:34:52.985Z"),
"history_created_by" : ObjectId("604df7857d58ab06685ed02e"),
"value" : NumberLong(1)
}
],
"history_created_by" : ObjectId("604df7857d58ab06685ed02e")
},
{
"_id" : ObjectId("60713b0fe151beea45602e56"),
"created_by" : ObjectId("604df7857d58ab06685ed02e"),
"ticket_id" : ObjectId("60713b0f68badcedddd875b8"),
"created_at" : ISODate("2021-04-10T05:43:43.228Z"),
"history_created_by" : ObjectId("604df7857d58ab06685ed02e"),
"status" : [
{
"history_created_by" : ObjectId("604df7857d58ab06685ed02e"),
"value" : NumberLong(1),
"history_from" : ISODate("2021-04-10T05:43:43.277Z")
}
],
"serial" : "538142578",
"expire_at" : ISODate("2021-04-10T19:30:00Z"),
"history_from" : ISODate("2021-04-10T05:43:43.281Z"),}
Your query looks good as per your example documents you can check playground,
But when I try it in MongoDB 4.4.4, I encounter input to $filter must be an array not long.
It is not any MongoDB version specific issue, The error says provided field status as input in $filter is not array type it is long type, and $filter input should be an array type.
Definitely there are/is some document(s) having non array value in status field.
If you want to check you can match condition before $filter operation,
{ $match: { status: { $type: "array" } } } // 4 = "array"
This will filter documents by status, it should be an array type.

elasticsearch how to use exact search and ignore the keyword special characters in keywords?

i had some id value (numeric and text combination) in my elasticsearch index, and in my program user might will input some special characters in search keyword.
and i want to know is there anyway that can let elasticsearch to use exact search and also can remove some special characters in search keywork
i already use custom analyzer to split search keyword by some special characters. and use query->match to search data, and i still got no results
data
{
"_index": "testdata",
"_type": "_doc",
"_id": "11112222",
"_source": {
"testid": "1MK444750"
}
}
custom analyzer
"analysis" : {
"analyzer" : {
"testidanalyzer" : {
"pattern" : """([^\w\d]+|_)""",
"type" : "pattern"
}
}
}
mapping
{
"article" : {
"mappings" : {
"_doc" : {
"properties" : {
"testid" : {
"type" : "text",
"analyzer" : "testidanalyzer"
}
}
}
}
}
}
here's my elasticsearch query
GET /testdata/_search
{
"query": {
"match": {
// "testid": "1MK_444-750" // no result
"testid": "1MK444750"
}
}
}
and analyzer successfully seprated separated my keyword, but i just can't match anything in result
POST /testdata/_analyze
{
"analyzer": "testidanalyzer",
"text": "1MK_444-750"
}
{
"tokens" : [
{
"token" : "1mk",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "444",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "750",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
}
]
}
please help, thanks in advance!
First off, you should probably model the testid field as keyword rather than text, it's a more appropriate data type.
You want to put in a feature whereby some characters (_, -) are effectively ignored at search time. You can achieve this by giving your field a normalizer, which tells Elasticsearch how to preprocess data for this field prior to indexing or searching. Specifically, you can declare a mapping char filter in your normalizer that replaces these characters with an empty string.
This is how all these changes would fit into your mapping:
PUT /testdata
{
"settings": {
"analysis": {
"char_filter": {
"mycharfilter": {
"type": "mapping",
"mappings": [
"_ => ",
"- => "
]
}
},
"normalizer": {
"mynormalizer": {
"type": "custom",
"char_filter": [
"mycharfilter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"testid" : {
"type" : "keyword",
"normalizer" : "mynormalizer"
}
}
}
}
}
The following searches would then produce the same results:
GET /testdata/_search
{
"query": {
"match": {
"testid": "1MK444750"
}
}
}
GET /testdata/_search
{
"query": {
"match": {
"testid": "1MK_444-750"
}
}
}

Druid - descending timestamps with groupBy query

What I'm asking for should be very simple but the Druid docs have little to no info about this.
I am making a groupBy query, and the data is very large so I'm "paging" it by increasing limitSpec.limit on each subsequent query.
By default, the returned array starts from the beginning timestamp and moves forward in time. I want the results to start from the end timestamp and move backwards in time from there.
Does anyone know how to do that?
So in other words, by default a groupBy query would look like this:
[
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:00.000Z",
"event" : {
"total_usage" : <some_value_one>
}
},
{
"version" : "v1",
"timestamp" : "2012-01-02T00:00:00.000Z",
"event" : {
"total_usage" : <some_value_two>
}
}
]
Whereas I want it to look like this:
[
{
"version" : "v1",
"timestamp" : "2012-01-02T00:00:00.000Z",
"event" : {
"total_usage" : <some_value_two>
}
},
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:00.000Z",
"event" : {
"total_usage" : <some_value_one>
}
}
]
You can achieve the ordering by using the "columns" attribute in the limit spec. see the below example.
{
"type" : "default",
"limit" : <integer_value>,
"columns" : [list of OrderByColumnSpec],
}
For more details you can refer the below druid doc -
http://druid.io/docs/latest/querying/limitspec.html
You can add timestamp as a dimension but truncated to date (assuming you use day granularity in your query) and force Druid to sort the result first by dimension values and then by timestamp.
Example Query:
{
"dataSource": "your_datasource",
"queryType": "groupBy",
"dimensions": [
{
"type": "default",
"dimension": "some_dimension_in",
"outputName": "some_dimension_out",
"outputType": "STRING"
},
{
"type": "extraction",
"dimension": "__time",
"outputName": "__timestamp",
"extractionFn": {
"type": "timeFormat",
"format" : "yyyy-MM-dd"
}
}
],
"aggregations": [
{
"type": "doubleSum",
"name": "some_metric",
"fieldName": "some_metric_field"
}
],
"limitSpec": {
"type": "default",
"limit": 1000,
"columns": [
{
"dimension": "__timestamp",
"direction": "descending",
"dimensionOrder": "numeric"
},
{
"dimension": "some_metric",
"direction": "descending",
"dimensionOrder": "numeric"
}
]
},
"intervals": [
"2019-09-01/2019-10-01"
],
"granularity": "day",
"context": {
"sortByDimsFirst": "true"
}
}

Conditionally remove Subdocument nested inside array of document MongoDB

I have a collection with document like this:
{
"_id" : "ABC",
"Name" : "Rajesh",
"createstmp" : ISODate("2015-06-22T17:09:16.705Z"),
"updstmp" : ISODate("2015-06-22T19:31:53.527Z"),
"AvgValue" : "65",
"PreValues" : [
{
"Date" : 20150709,
"Rate" : [
{
"Time" : 1566,
"value" : 60
},
{
"Time" : 1500,
"value" : 400
},
{
"Time" : 1400,
"value" : 100
},
{
"Time" : 1500,
"value" : 103
}
]
}
]
}
I want to remove the duplicate doc for a particular Date value
eg If Time value is 1500, I need to pull the document and push it the new value for (Value) in single bulk operation.
Here is my query
bulk.find({ "_id":"ABC" })
.update(
{
"_id": "ABC",
"PreValues": { "Date": 20150709 }
},
{
$pu‌​ll: { "PreValues": { "Rate": { "Time": 1000 } } }
}
);
bulk.find({ "_id":"ABC" })
.update(
{ "_id": "ABC","PreValues": { "Date": 20150709 }},
{ $pu‌​sh : {
"PreValues": { "Rate": { "Time": 1000,"Rating": 100 }}
}}
);
bulk.execute();
It's not a great idea to have nested arrays since the only thing you will ever be able to do atomically is $push or $pull. See the positional $ operator for details on why "nested arrays" are not good here, but basically you can only ever match the position of the "outer" array element.
And that is basically what you are missing here, and of course the proper "dot notation" for accessing the elements:
var bulk = db.ABA.initializeOrderedBulkOp();
bulk.find({ "_id": "ABC", "PreValues.Date": 20150709 })
.updateOne({ "$pull": { "PreValues.$.Rate": { "Time": 1500 } } })
bulk.find({ "_id": "ABC", "PreValues.Date": 20150709 })
.updateOne({ "$push": { "PreValues.$.Rate": { "Time": 1500, "Rating": 100 } } })
bulk.execute();
Which alters the document like so:
{
"_id" : "ABC",
"Name" : "Rajesh",
"createstmp" : ISODate("2015-06-22T17:09:16.705Z"),
"updstmp" : ISODate("2015-06-22T19:31:53.527Z"),
"AvgValue" : "65",
"PreValues" : [
{
"Date" : 20150709,
"Rate" : [
{
"Time" : 1566,
"value" : 60
},
{
"Time" : 1400,
"value" : 100
},
{
"Time" : 1500,
"Rating" : 100
}
]
}
]
}
That is the correct syntax for both statements there and sends both requests to the server at the same time with a single response.
Note that you need to inclide in the .find() query a field from the outer array to match. This is so the positional $ operator is populated with the matched index of that element and the operator knows which array element to act upon.

Resources