I'm looking at Elasticsearch for the first time and spent around a day looking at it. We already use Lucene extensively and want to start using ES instead. I'm looking at alternative data structures to what we currently have.
If I run *match_all* query this is what I get at the moment. I am happy with this structure.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 22,
"max_score": 1,
"hits": [
{
"_index": "integration-test-static",
"_type": "sport",
"_id": "4d38e07b-f3d3-4af2-9221-60450b18264a",
"_score": 1,
"_source": {
"Descriptions": [
{
"FeedSource": "dde58b3b-145b-4864-9f7c-43c64c2fe815",
"Value": "Football"
},
{
"FeedSource": "e4b9ad44-00d7-4216-adf5-3a37eafc4c93",
"Value": "Football"
}
],
"Synonyms": [
"Football"
]
}
}
]
}
}
What I can't figure out is how a query is written to pull back this document by searching for the synonym "Football". Looks like it should be easy!
I got this approach after reading this: http://gibrown.wordpress.com/2013/01/24/elasticsearch-five-things-i-was-doing-wrong/
He mentions storing multiple fields in arrays. I realise my example does not have multiple fields, but we will certainly be looking for a solution which can cater for them.
Tried various different queries with filters, bool things, term this and terms that, none return.
What does your search and mappings look like?
If you let Elasticsearch generate the mapping, it'll use the standard analyzer which lowercases the text (and removes stopwords).
So Football will actually be indexed as football. The term-family of queries/filters do not do text analysis, so term:Football will be looking for Football, which is not indexed. The match-family of queries do.
This is a very common problem, and is covered quite extensively in my article on Troubleshooting Elasticsearch searches, for Beginners, which can be worth skimming through. Text analysis is a very important part of working with search, so there's some more articles about it as well.
A simple match query would work in this scenario.
POST integration-test-static/_search
{
"query": {
"match": {
"Synonyms": "Football"
}
}
}
Which returns:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "integration-test-static",
"_type": "sport",
"_id": "4d38e07b-f3d3-4af2-9221-60450b18264a",
"_score": 0.30685282,
"_source": {
"Descriptions": [
{
"FeedSource": "dde58b3b-145b-4864-9f7c-43c64c2fe815",
"Value": "Football"
},
{
"FeedSource": "e4b9ad44-00d7-4216-adf5-3a37eafc4c93",
"Value": "Football"
}
],
"Synonyms": [
"Football"
]
}
}
]
}
}
Related
Looking to ingest this RESTAPI data to SPLUNK, but having issues with LINE BREAKER, can't seem to discover the correct combination for props.conf.
Also as data is returned in array format without keys, do I need a script to add the keys to the returned array data or can this be achieved using SPLUNK?
N.B.
The keys are returned in the tail of the response.
RESTAPI CALL:
{{base_url}}accounts/{{account}}/{{siteid}}/report?dimensions=queryName,queryType,responseCode,responseCached,coloName,origin,dayOfWeek,tcp,ipVersion,querySizeBucket,responseSizeBucket&metrics=queryCount,uncachedCount,staleCount,responseTimeAvg&limit=2
Any help appreciated.
{
"result": {
"rows": 100,
"data": [
{
"dimensions": [
"college.edu",
"A",
"REFUSED",
"uncached",
"EWR",
"192.0.0.0",
"1",
"0",
"4",
"48-63",
"48-63"
],
"metrics": [
1,
1,
0,
16
]
},
{
"dimensions": [
"school.edu",
"A",
"REFUSED",
"uncached",
"EWR",
"192.0.0.0",
"1",
"0",
"4",
"32-47",
"32-47"
],
"metrics": [
1,
1,
0,
10
]
}
],
"data_lag": 0,
"min": {},
"max": {},
"totals": {
"queryCount": 12,
"responseTimeAvg": 37.28936572607269,
"staleCount": 0,
"uncachedCount": 2147541
},
"query": {
"dimensions": [
"queryName",
"queryType",
"responseCode",
"responseCached",
"coloName",
"origin",
"dayOfWeek",
"tcp",
"ipVersion",
"querySizeBucket",
"responseSizeBucket"
],
"metrics": [
"queryCount",
"uncachedCount",
"staleCount",
"responseTimeAvg"
],
"since": "2022-10-17T04:37:00Z",
"until": "2022-10-17T10:37:00Z",
"limit": 100
}
},
"success": true,
"errors": [],
"messages": []
}
Assuming you want the JSON object to be a single event, the LINE_BREAKER setting should be }([\r\n]+){.
Splunk should have no problems parsing the JSON, but I think there will be problems relating metrics to dimensions because there are multiple sets of data and only one set of keys. Creating a script to combine them seems to be the best option.
I need FEMMES.COM to get tokenized as singular + plural forms of the base word FEMME.
Custom Analyzer Config
"analyzers": [ { "#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer", "name": "text_language_search_custom_analyzer", "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer", "tokenFilters": [ "lowercase", "asciifolding" ], "charFilters": [ "html_strip" ] } ], "tokenizers": [ { "#odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer", "name": "text_language_search_custom_analyzer_ms_tokenizer", "maxTokenLength": 300, "isSearchTokenizer": false, "language": "english" } ], "tokenFilters": [], "charFilters": []}
Analyze API call for FEMMES
{ "analyzer": "text_language_search_custom_analyzer", "text": "FEMMES" }
Analyze API response for FEMMES
{ "#odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 } ] }
Analyze API response for FEMMES.COM
{ "#odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "com", "startOffset": 7, "endOffset": 10, "position": 1 } ] }
Analyze API response for FEMMES COM
{ "#odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "com", "startOffset": 7, "endOffset": 10, "position": 1 } ]}
I think I figured this one out myself after some experimentation. I found the MappingCharFilter could be used to replace . with , before the indexer did the tokenization. This allowed the lemmatization/stemming to work as expected on the terms in question. I need to do more thorough integration tests with our other use cases, but I think this would solve the problem for anybody facing the same type of issue.
My previous answer was not correct. Azure Search implementation actually applies the language tokenizer BEFORE token filters. This essentially made the WordDelimiterToken filter useless in my use case.
What I ended up having to do was to pre-process data BEFORE I uploaded to Azure for indexing. In my C# code, I added some regex logic that would break apart text like FEMMES2017 into FEMMES 2017, before I sent it to Azure. This way, when the text got to Azure, the indexer would see FEMMES by itself and properly tokenize as FEMME and FEMMES using the language tokenizer.
I want to filter out all documents which contain a specific value in an array field. I.e. the value is an element of that array field.
To be specific - I want to select all documents which names contains test-name, see the example below.
So when I do an empty search with
curl -XGET localhost:9200/test-index/_search
the result is
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 50,
"max_score": 1,
"hits": [
{
"_index": "test-index",
"_type": "test",
"_id": "34873ae4-f394-42ec-b2fc-41736e053c69",
"_score": 1,
"_source": {
"names": [
"test-name"
],
"age": 100,
...
}
},
...
}
}
But in case of a more specific query
curl -XPOST localhost:9200/test-index/_search -d '{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"names": "test-name"
}
}
}
}
}'
I don't get any results
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
There are some questions similar to this one. Although, I cannot get any of the answers to work for me.
System specs: Elasticsearch 5.1.1, Ubuntu 16.04
EDIT
curl -XGET localhost:9200/test-index
...
"names": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
...
That's because the names field is analyzed and test name gets indexed as two tokens test and name.
Searching for the test name term will hence not yield anything. If you use match instead, you'll get the document.
If you want to check for the exact value test name (i.e. the two tokens one after another), then you need to change your names field to a keyword type instead of text
UPDATE
According to your mapping, the names field is analyzed, you need to use the names.keyword field instead and it will work, like this:
curl -XPOST localhost:9200/test-index/_search -d '{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"names.keyword": "test-name"
}
}
}
}
}'
I am trying to get an object out of a JSON array that is stored in elasticsearch. The layout is like this:
[
object{}
object{}
object{}
]
What I need for when I do a search and it hits on one of these objects, to get the specific object it matches to. Currently, using the java API I am searching with:
QueryBuilder qb = QueryBuilders.boolQuery()
.should(QueryBuilders.matchQuery("text", "pottery").boost(5)
.minimumShouldMatch("1"));
SearchResponse response = client.prepareSearch("stuff")
.setTypes("things")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(qb)
.setPostFilter(filter)//.setHighlighterQuery(qb)
.addField("places.numbers")
.addField("name")
.addField("city")
.setFrom(0).setSize(60).setExplain(true)
.execute()
.actionGet();
But this will just return the whole object that I hit or when I tell it to return the field "places.numbers" it will only return the first object in the "palces" array, not the one that was matched in the query.
Thank you for any help!
There are a couple of ways to handle this. I would probably do it with a nested type and inner hits, given what you've shown in your question, but it could also probably be done with the parent/child relationship.
Here is an example with nested docs. I set up a simple index like this:
PUT /test_index
{
"mappings": {
"parent_doc": {
"properties": {
"parent_name": {
"type": "string"
},
"nested_docs": {
"type": "nested",
"properties": {
"nested_name": {
"type": "string"
}
}
}
}
}
}
}
Then added a couple of simple documents:
POST /test_index/parent_doc/_bulk
{"index":{"_id":1}}
{"parent_name":"p1","nested_docs":[{"nested_name":"n1"},{"nested_name":"n2"}]}
{"index":{"_id":2}}
{"parent_name":"p2","nested_docs":[{"nested_name":"n3"},{"nested_name":"n4"}]}
And now I can search like this, using "inner_hits":
POST /test_index/_search
{
"query": {
"nested": {
"path": "nested_docs",
"query": {
"match": {
"nested_docs.nested_name": "n3"
}
},
"inner_hits" : {}
}
}
}
which returns:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.098612,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "2",
"_score": 2.098612,
"_source": {
"parent_name": "p2",
"nested_docs": [
{
"nested_name": "n3"
},
{
"nested_name": "n4"
}
]
},
"inner_hits": {
"nested_docs": {
"hits": {
"total": 1,
"max_score": 2.098612,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "2",
"_nested": {
"field": "nested_docs",
"offset": 0
},
"_score": 2.098612,
"_source": {
"nested_name": "n3"
}
}
]
}
}
}
}
]
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/ef7debf436fec2a10097ba2106d5ff30ff8d7c77
I'm using Elasticsearch for this project but a Solr solution might be appropriate too. In the query I'd like to include a portion of a should clause that will return results even if none of the other terms can. This will be used for document popularity. I'll periodically calculate reading popularity and add a float field to each doc with a numeric value.
The idea is to return docs based on terms but when that fails, return popular docs ranked by popularity. These should be ordered by term match scores or magnitude of popularity score.
I realize that I could quantize the popularity and treat it like a tag "hottest", "hotter", "hot"... but would like to use numeric field since the ranking is well defined.
Here is the current form of my data (from fetch by id):
GET /index/docs/ipad
returns a sample object
{
"_index": "index",
"_type": "docs",
"_id": "doc1",
"_version": 1,
"found": true,
"_source": {
"category": ["tablets", "electronics"],
"text": ["buy", "an", "ipad"],
"popularity": 0.95347457,
"id": "doc1"
}
}
Current query format
POST /index/docs/_search
{
"size": 10,
"query": {
"bool": {
"should": [
{"terms": {"text": ["ipad"]}}
],
"must": [
{"terms": {"category": ["electronics"]}}
]
}
}
}
This may seem an odd query format but these are structured objects, not free form text.
Can I add popularity to this query so that it returns items ranked by popularity magnitude along with those returned by the should terms? I'd boost the actual terms above the popularity so they'd be favored.
Note I do not want to boost by popularity, I want to return popular if the rest of the query returns nothing.
One approach I can think of is wrapping match_all filter in constant score
and using sort on score followed by popularity
example:
{
"size": 10,
"query": {
"bool": {
"should": [
{
"terms": {
"text": [
"ipad"
]
}
},
{
"constant_score": {
"filter": {
"match_all": {}
},
"boost": 0
}
}
],
"must": [
{
"terms": {
"category": [
"electronics"
]
}
}
],
"minimum_should_match": 1
}
},
"sort": [
{
"_score": {
"order": "desc"
}
},
{
"popularity": {
"unmapped_type": "double"
}
}
]
}
You want to look into the function score query and a decay function for this.
Here's a gentle intro: https://www.found.no/foundation/function-scoring/