Multiple match_phrase conditions with another bool in a single ElasticSearch query? - database

I am trying to conduct an Elasticsearch query that searched a text field ("body") and returns items that match at least one of two multi-word phrases I provide (ie: "stack overflow" OR "the stackoverflow"). I would also like the query to only provide results that occur after a given timestamp, with the results ordered by time.
My current solution is below. I believe the MUST is working correctly (gte a timestamp), but the BOOL + SHOULD with two match_phrases is not correct. I am getting the following error:
Unexpected character ('{' (code 123)): was expecting double-quote to start field name
Which I think is because I have two match_phrases in there?
This is the ES mapping and the details of the ES API I am using details are here.
{"query":
{"bool":
{"should":
[{"match_phrase":
{"body":"a+phrase"}
},
{"match_phrase":
{"body":"another+phrase"}
}
]
},
{"bool":
{"must":
[{"range":
{"created_at:
{"gte":"thispage"}
}
}
]}
}
},"size":10000,
"sort":"created_at"
}

I think you were just missing a single " after created_at.
{
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "1534004694"
}
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"body": "a+phrase"
}
},
{
"match_phrase": {
"body": "another+phrase"
}
}
]
}
}
]
}
},
"size": 10,
"sort": "created_at"
}
Also, you are allowed to have both must and should as properties of a bool object, so this is also worth trying.
{
"query": {
"bool": {
"must": {
"range": {
"created_at": {
"gte": "1534004694"
}
}
},
"should": [
{
"match_phrase": {
"body": "a+phrase"
}
},
{
"match_phrase": {
"body": "another+phrase"
}
}
]
}
},
"size": 10,
"sort": "created_at"
}
On a side note, Postman or any JSON formatter/validator would really help in determining where the error is.

Related

Multikey partial index not used with elemMatch

Consider the following document format which has an array field tasks holding embedded documents
{
"foo": "bar",
"tasks": [
{
"status": "sleep",
"id": "1"
},
{
"status": "active",
"id": "2"
}
]
}
There exists a partial index on key tasks.id
{
"v": 2,
"unique": true,
"key": {
"tasks.id": 1
},
"name": "tasks.id_1",
"partialFilterExpression": {
"tasks.id": {
"$exists": true
}
},
"ns": "zardb.quxcollection"
}
The following $elemMatch query with multiple conditions on the same array element
db.quxcollection.find(
{
"tasks": {
"$elemMatch": {
"id": {
"$eq": "1"
},
"status": {
"$nin": ["active"]
}
}
}
}).explain()
does not seem to use the index
"winningPlan": {
"stage": "COLLSCAN",
"filter": {
"tasks": {
"$elemMatch": {
"$and": [{
"id": {
"$eq": "1"
}
},
{
"status": {
"$not": {
"$eq": "active"
}
}
}
]
}
}
},
"direction": "forward"
}
How can I make the above query use the index? The index does seem to be used via dot notation
db.quxcollection.find({"tasks.id": "1"})
however I need the same array element to match multiple conditions which includes the status field, and the following does not seem to be equivalent to the above $elemMatch based query
db.quxcollection.find({
"tasks.id": "1",
"tasks.status": { "$nin": ["active"] }
})
The way the partial indexes work is it uses the path as a key. With $elemMatch you don't have the path explicitly in the query. If you check it with .explain("allPlansExecution") it is not even considered by the query planner.
To benefit from the index you can specify the path in the query:
db.quxcollection.find(
{
"tasks.id": "1",
"tasks": {
"$elemMatch": {
"id": {
"$eq": "1"
},
"status": {
"$nin": ["active"]
}
}
}
}).explain()
It duplicates part of the elemMatch condition, so the index will be used to get all documents containing tasks of specific id, then it will filter out documents with "active" tasks at fetch stage. I must admit the query doesn't look nice, so may be add some comments to the code with explanations.

ElasticSearch sort array size incoherent results

I am trying to sort by array size in ElasticSearch 7.1.
I indexed the following data without creating any custom mapping:
{
"myarray": [{
"field": {
"value": "test"
}
}]
}
When I look at the mapping, it is giving me:
{
"properties": {
"myarray": {
"properties": {
"field": {
"properties": {
"value": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
Now I want to query the index and sort by the highest number of elements in myarray. I have tried doing:
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value') ? doc['myarray.field.value'].values.size() : 0"
}
}
}
which gives me an error like Fielddata is disabled on text fields by default.[...] Alternatively use a keyword field instead. So I try with
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value.keyword') ? doc['myarray.field.value.keyword'].values.size() : 0"
}
}
}
which gives me the error Illegal list shortcut value [values].. So then I tried with (removing the values keyword):
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value.keyword') ? doc['myarray.field.value.keyword'].size() : 0"
}
}
}
and it works, however I have some results that are sorted nicely and suddenly an element that should be at the top appears in the middle.
Is that because it is sorting by the length of the value as a string and not the length of myarray?
This is because text type mapping does not provide sorting, to add sorting you must map the array field with keyword type.
For more info and syntax please refer this : https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-sort.html

elasticsearch aggregates some values in a single field

I have some raw data
{
{
"id":1,
"message":"intercept_log,UDP,0.0.0.0,68,255.255.255.255,67"
},
{
"id":2,
"message":"intercept_log,TCP,172.22.96.4,52085,239.255.255.250,3702,1:"
},
{
"id":3,
"message":"intercept_log,UDP,1.0.0.0,68,255.255.255.255,67"
},
{
"id":4,
"message":"intercept_log,TCP,173.22.96.4,52085,239.255.255.250,3702,1:"
}
}
Demand
I want to group this data by the value of the message part of the message.
Output value like that
{
{
"GroupValue":"TCP",
"DocCount":"2"
},
{
"GroupValue":"UDP",
"DocCount":"2"
}
}
Try
I have tried with these codes but failed
GET systemevent*/_search
{
"size": 0,
"aggs": {
"tags": {
"terms": {
"field": "message.keyword",
"include": " intercept_log[,,](.*?)[,,].*?"
}
}
},
"track_total_hits": true
}
Now I try to use pipelines to meet this need.
"aggs" seems to only group fields.
Does anyone have a better idea?
Link
Terms aggregation
Update
My scene is a little special. I collect logs from many different servers, and then import the logs into es. Therefore, there is a big difference between message fields. If you directly use script statements for grouping statistics, it will result in group failure or inaccurate grouping. I try to filter out some data according to the conditions, and then use script to group the operation code (comment code 1), but this code can't group the correct results.
This is my scene to add:
Our team uses es to analyze the server log, uses rsyslog to forward the data to the server center, and then uses logstash to filter and extract the data to es. At this time, there is a field called message in ES, and the value of message is the detailed log information. At this time, we need to count the data containing some values in the message.
comment code 1
POST systemevent*/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match_phrase": {
"message": {
"query": "intercept_log"
}
}
}
]
}
},
"aggs": {
"protocol": {
"terms": {
"script": "def values = /,/.split(doc['message.keyword'].value); return values.length > 1 ? values[1] : 'N/A'",
"size": 10
}
}
},
"track_total_hits": true
}
comment code 2
POST test2/_search
{
"size": 0,
"aggs": {
"protocol": {
"terms": {
"script": "def values = /.*,.*/.matcher( doc['host.keyword'].value ); if( name.matches() ) {return values.group(1) } else { return 'N/A' }",
"size": 10
}
}
}
}
The easiest way to solve this is by leveraging scripts in the terms aggregation. The script would simply split on commas and take the second value.
POST systemevent*/_search
{
"size": 0,
"aggs": {
"protocol": {
"terms": {
"script": "def values = /,/.split(doc['message.keyword'].value); return values.length > 1 ? values[1] : 'N/A';",
"size": 10
}
}
}
}
Use Regex
POST test2/_search
{
"size": 0,
"aggs": {
"protocol": {
"terms": {
"script": "def m = /.*proto='(.*?)'./.matcher(doc['message.keyword'].value ); if( m.matches() ) { return m.group(1) } else { return 'N/A' }"
}
}
}
}
The results would look like
"buckets" : [
{
"key" : "TCP",
"doc_count" : 2
},
{
"key" : "UDP",
"doc_count" : 2
}
]
A better and more efficient way would be to split the message field into new fields using an ingest pipeline or Logstash.

Remove elements/objects From Array in ElasticSearch Followed by Matching Query

I'm having issues trying to remove elements/objects from an array in elasticsearch.
This is the mapping for the index:
{
"example1": {
"mappings": {
"doc": {
"properties": {
"locations": {
"type": "geo_point"
},
"postDate": {
"type": "date"
},
"status": {
"type": "long"
},
"user": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
And this is an example document.
{
"_index": "example1",
"_type": "doc",
"_id": "8036",
"_score": 1,
"_source": {
"user": "kimchy8036",
"postDate": "2009-11-15T13:12:00",
"locations": [
[
72.79887719999999,
21.193036000000003
],
[
-1.8262150000000001,
51.178881999999994
]
]
}
}
Using the query below, I can add multiple locations.
POST /example1/_update_by_query
{
"query": {
"match": {
"_id": "3"
}
},
"script": {
"lang": "painless",
"inline": "ctx._source.locations.add(params.newsupp)",
"params": {
"newsupp": [
-74.00,
41.12121
]
}
}
}
But I'm not able to remove array objects from locations. I have tried the query below but it's not working.
POST example1/doc/3/_update
{
"script": {
"lang": "painless",
"inline": "ctx._source.locations.remove(params.tag)",
"params": {
"tag": [
-74.00,
41.12121
]
}
}
}
Kindly let me know where i am doing wrong here. I am using elastic version 5.5.2
In painless scripts, Array.remove() method removes by index, not by value.
Here's a working example that removes array elements by value in Elasticsearch script:
POST objects/_update_by_query
{
"query": {
... // use regular ES query to remove only in relevant documents
},
"script": {
"source": """
if (ctx._source[params.array_attribute] != null) {
for (int i=ctx._source[params.array_attribute].length-1; i>=0; i--) {
if (ctx._source[params.array_attribute][i] == params.value_to_remove) {
ctx._source[params.array_attribute].remove(i);
}
}
}
""",
"params": {
"array_attribute": "<NAME_OF_ARRAY_PROPERTY_TO_REMOVE_VALUE_IN>",
"value_to_remove": "<VALUE_TO_REMOVE_FROM_ARRAY>",
}
}
}
You might want to simplify script, if your script shall only remove values from one specific array attribute. For example, removing "green" from document's .color_list array:
_doc/001 = {
"color_list": ["red", "blue", "green"]
}
Script to remove "green":
POST objects/_update_by_query
{
"query": {
... // use regular ES query to remove only in relevant documents
},
"script": {
"source": """
for (int i=ctx._source.color_list.length-1; i>=0; i--) {
if (ctx._source.color_list[i] == params.color_to_remove) {
ctx._source.color_list.remove(i);
}
}
""",
"params": {
"color_to_remove": "green"
}
}
}
Unlike add(), remove() takes the index of the element and remove it.
Your ctx._source.locations in painless is an ArrayList. It has List's remove() method:
E remove(int index)
Removes the element at the specified position in this list (optional operation). ...
See Painless API - List for other methods.
See this answer for example code.
"script" : {
"lang":"painless",
"inline":"ctx._source.locations.remove(params.tag)",
"params":{
"tag":indexToRemove
}
}
If with ctx._source.locations.add(elt) You add the element, with ctx._source.locations.remove(indexToRemove), you remove by the index of element in the array.

Permissions in ElasticSearch

Given a set of documents similar to the following:
{
"value": "Some random string here",
"permissions": ["job.view", "special.permission"]
}
We want to be able to create a search that'll allow us to pass an array of permissions to match against, for example, we might want to pass in
["job.view", "foo.bar", "pineapple.eat"]
as the permissions.
The document should only return in the search if all the permissions listed in the document exist in the set passed in as part of the query.
Not fussed whether we have to change the document layout, or the query, but, we're currently restricted to not being able to use the Scripting API (due to AWS).
There is a convoluted way to do this which requires you also index the number of permissions in the document, i.e. if the document contains two values in the permissions array (like in your example) then you'd also include the field permissions_count: 2.
Then your query would contain as many bool/should queries as there are permissions permutations in your search array. For instance, in your search array you have 3 permissions ["job.view", "foo.bar", "pineapple.eat"], then you need to check the following conditions:
permissions contains all three permissions and permissions_count: 3
or permissions contains two of the three permissions (three combinations) and permissions_count: 2
or permissions contains only one of the three permissions (three possibilities) and permissions_count: 1
So when checking for 1 permission, you'll have 1 query in your bool/should. For 2 permissions, you'll have 3, for 3 permissions you have 7, etc..
The full query is shown below:
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"terms": {
"permissions": ["job.view", "special.permission", "pineapple.eat"]
}
},
{
"term": {
"permissions_count": 3
}
}
]
}
},
{
"bool": {
"must": [
{
"terms": {
"permissions": ["special.permission", "pineapple.eat"]
}
},
{
"term": {
"permissions_count": 2
}
}
]
}
},
{
"bool": {
"must": [
{
"terms": {
"permissions": ["job.view", "pineapple.eat"]
}
},
{
"term": {
"permissions_count": 2
}
}
]
}
},
{
"bool": {
"must": [
{
"terms": {
"permissions": ["special.permission", "job.view"]
}
},
{
"term": {
"permissions_count": 2
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"permissions": "special.permission"
}
},
{
"term": {
"permissions_count": 1
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"permissions": "job.view"
}
},
{
"term": {
"permissions_count": 1
}
}
]
}
},
{
"bool": {
"must": [
{
"term": {
"permissions": "pineapple.eat"
}
},
{
"term": {
"permissions_count": 1
}
}
]
}
}
],
"minimum_number_should_match": 1
}
}
}
As you can see it's a bit convoluted only to check for three permissions...
I would maybe approach this problem from a different perspective and come up with another field (or set of fields) that would contain a bitset of permissions or cleverly chosen integers for each permissions, but I haven't fully thought-out this one yet.
Another solution would be to leverage Shield and document-access control instead of storing the permissions within the documents themselves.

Resources