When creating an index definition in Azure Search, is there any way to add additional stop words just for that index. For example if you are indexing street names one would like to strip out Road, Close, Avenue etc.
And if one makes the field non-searchable i.e. the whole thing is indexed as one term, then what happens to something like Birken Court Road. Would the term being indexed be Birken Court.
Many thanks
You can define an additional set of stopwords using a custom analyzer.
For example,
{
"name":"myindex",
"fields":[
{
"name":"id",
"type":"Edm.String",
"key":true,
"searchable":false
},
{
"name":"text",
"type":"Edm.String",
"searchable":true,
"analyzer":"my_analyzer"
}
],
"analyzers":[
{
"name":"my_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[
"lowercase",
"english_stopwords",
"my_stopwords"
]
}
],
"tokenFilters":[
{
"name":"english_stopwords",
"#odata.type":"#Microsoft.Azure.Search.StopwordsTokenFilter",
"stopwordsList":"english"
},
{
"name":"my_stopwords",
"#odata.type":"#Microsoft.Azure.Search.StopwordsTokenFilter",
"stopwords": ["road", "avenue"]
}
]
}
In this index definition I'm setting a custom analyzer on the text field that used the standard tokenizer, lowercase token filter and two stopwords token filters, one for standard english stopwords and one for the additional set of stopwords. You can test the behavior of your custom analyzer with the Analyze API, for example:
request:
{
"text":"going up the road",
"analyzer": "my_analyzer"
}
response:
{
"tokens": [
{
"token": "going",
"startOffset": 0,
"endOffset": 5,
"position": 0
},
{
"token": "up",
"startOffset": 6,
"endOffset": 8,
"position": 1
}
]
}
Analyzers are not applied to non-searchable fields, therefore the stopword in your example would not be removed. To learn more about query and document processing see: How full text search works in Azure Search.
Related
I'm searching for some text in a field.
but the problem is whenever two documents contain all search tokens, the document which has more search tokens gets more points instead of the document that has less length.
My ElasticSearch index contains some names of foods. and I wanna search for some food in it.
The documents structure are like this
{"text": "NAME OF FOOD"}
Now I have two documents like
1: {"text": "Apple Syrup Apple Apple Syrup Apple Smoczyk's"}
2: {"text": "Apple Apple"}
If I search using this query
{
"query": {
"match": {
"text": {
"query": "Apple"
}
}
}
}
The first document comes first because contains more Apple in it.
which is not my expected result. I will be good that the second document gets more point because has Apple in it and its length is shorter then first one.
Elastic search scoring gives weightage to term frequency , field length. In general shorter fields are scored higher but term frequency can offset it.
You can use unique filter to generate unique tokens for the text. This way multiple occurrence of same token will not effect the scoring.
Mapping
{
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"unique", "lowercase"
]
}
}
}
}
}
Analyze
GET index29/_analyze
{
"text": "Apple Apple",
"analyzer": "my_analyzer"
}
Result
{
"tokens" : [
{
"token" : "apple",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
Only single token is generated even though apple appears twice.
I have a quite big number of records currently stored in mongodb, each looks somehow like this:
{
"_id" : ObjectId("5c38d267b87d0a05d8cd4dc2"),
"tech" : "NodeJs",
"packagename" : "package-name",
"packageversion" : "0.0.1",
"total_loc" : 474,
"total_files" : 7,
"tecloc" : {
"JavaScript" : 316,
"Markdown" : 116,
"JSON" : 42
}
}
What I want to do is to find similar data record based on e.g., records which have about (+/-10%) the number of total_loc or use some of the same technologies (tecloc).
Can I somehow do this with a query against mongodb or is there a technology that fits better for what I want to do? I am fine with regenerating the data and storing it e.g., in elastic or some graph-db.
Thank you
One of the possibility to solve this problem is to use Elasticsearch. I'm not claiming that it's the only solution you have.
On the high level - you would need to setup Elasticsearch and index your data. There are various possibilities to achieve: mongo-connector, or Logstash and JDBC input plugin or even just dumping data from MongoDB and putting it manually. No limits to do this job.
The difference I would propose initially is to make field tecloc - multivalued field, by replacing { to [, and adding some other fields for line of code, e.g:
{
"tech": "NodeJs",
"packagename": "package-name",
"packageversion": "0.0.1",
"total_loc": 474,
"total_files": 7,
"tecloc": [
{
"name": "JavaScript",
"loc": 316
},
{
"name": "Markdown",
"loc": 116
},
{
"name": "JSON",
"loc": 42
}
]
}
This data model is very trivial and obviously have some limitations, but it's already something for you to start and see how well it fits your other use cases. Later you should discover nested type as one of the possibility to mimic your data more properly.
Regarding your exact search scenario - you could search those kind of documents with a query like this:
{
"query": {
"bool": {
"should": [
{
"term": {
"tecloc.name.keyword": {
"value": "Java"
}
}
},
{
"term": {
"tecloc.name.keyword": {
"value": "Markdown"
}
}
}
],
"must": [
{"range": {
"total_loc": {
"gte": 426,
"lte": 521
}
}}
]
}
}
}
Unfortunately, there is no support for syntax with +-10% so this is something that should be calculated on the client.
On the other side, I specified that we are searching documents which should have Java or Markdown, which return example document as well. In this case, if I would have document with both Java and Markdown the score of this document will be higher.
I am using Elasticsearch with no modifications whatsoever. This means the mappings, norms, and analyzed/not_analyzed is all default config. I have a very small data set of two items for experimentation purposes. The items have several fields but I query only on one, which is a multi-valued/array of strings field. The doc looks like this:
{
"_index": "index_profile",
"_type": "items",
"_id": "ega",
"_version": 1,
"found": true,
"_source": {
"clicked": [
"ega"
],
"profile_topics": [
"Twitter",
"Entertainment",
"ESPN",
"Comedy",
"University of Rhode Island",
"Humor",
"Basketball",
"Sports",
"Movies",
"SnapChat",
"Celebrities",
"Rite Aid",
"Education",
"Television",
"Country Music",
"Seattle",
"Beer",
"Hip Hop",
"Actors",
"David Cameron",
... // other topics
],
"id": "ega"
}
}
A sample query is:
GET /index_profile/items/_search
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"profile_topics": [
"Basketball"
]
}
}]
}
}
}
Again there are only two items and the one listed should match the query because the profile_topics field matches with the "Basketball" term. The other item does not match. I only get a result if I ask for clicked = ega in the should.
With Solr I would probably specify that the fields are multi-valued string arrays and are to have no norms and no analyzer so profile_topics are not stemmed or tokenized since all values should be treated as tokens (even the spaces). Not sure this would solve the problem but it is how I treat similar data on Solr.
I assume I have run afoul of some norm/analyzer/TF-IDF issue, if so how do I solve this so that even with two items the query will return ega. If possible I'd like to solve this index or type wide rather than field specific.
Basketball (with capital B) in terms will not be analyzed. This means this is the way it will be searched in the Elasticsearch index.
You say you have the defaults. If so, indexing Basketball under profile_topics field means that the actual term in the index will be basketball (with lowercase b) which is the result of the standard analyzer. So, either you set profile_topics as not_analyzed or you search for basketball and not Basketball.
Read this about terms.
Regarding to setting all the fields to not_analyzed you could do that with a dynamic template. Still with a template you can do what Logstash is doing: defining a .raw subfield for each string field and only this subfield is not_analyzed. The original/parent field still holds the analyzed version of the same text, maybe you will use in the future the analyzed field.
Take a look at this dynamic template. It's the one Logstash is using.
More specifically:
{
"template": "your_indices_name-*",
"mappings": {
"_default_": {
"_all": {
"enabled": true,
"omit_norms": true
},
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}
Here i quoted my code for multiple search filtering. I could not find the mistakes in that. please give a right code to make it work well.
Employee document:
{
"_id": "527c8d9327c6f27f17df0d2e17000530",
"_rev": "24-276a8dc913559901897fd601d2f9654f",
"proj_role": "TeamMember",
"work_total_experience": "3",
"personal": {
"languages_known": [
"English","Telugu"
]},
"skills": [
{
"skill_set": "Webservices Framework",
"skill_exp": 1,
"skill_certified": "yes",
"skill_rating": 3,
},
{
"skill_set": "Microsoft",
"skill_exp": 1,
"skill_certified": "yes",
"skill_rating": 3,
}
]
"framework_competency": "Nasscom",
"type": "employee-docs"
}
Design Document:
{
"_id": "_design/sample",
"_rev": "86-1250f792e6e84f6f33447a00cf64d61d",
"views": {},
"language": "javascript",
"indexes": {
"search": {
"index": "function(doc){\n index(\"default\", doc._id);if(doc.type=='employee-docs'){\nif (doc.proj_role){index(\"project_role\", doc.proj_role);}if(doc.work_total_experience){\nindex(\"work_experience\", doc.work_total_experience);}\nif(doc.personal.languages_known){for(c in doc.personal.languages_known){ \n index(\"languages_known\",doc.personal.languages_known[c]);}} if(doc.skills){for (var i=0;i<doc.skills.length;i++){\nindex('skill_set',doc.skills[i].skill_set);}}}}"
}
}
}
Run using below URL : https://ideyeah4.cloudant.com/opteamize_new/_design/sample/_search/search?q=project_role:TeamMember%20AND%20work_experience:%223%22%20AND%20languages_known:Telugu%20AND%20skill_set:Microsoft&include_docs=true
A simple way to debug this is to query the top 100 results in your index:
https://ideyeah4.cloudant.com/opteamize_new/_design/sample/_search/search?q=*:*&limit=100
This will at least tell you whether there are any documents in your index at all.
Your current query (without URL encoding) looks like:
project_role:TeamMember AND work_experience:"3" AND languages_known:Telugu AND skill_set:Microsoft
I'd suggest that some of these search values require quotes - always true when you are searching string values. Next, you could try:
project_role:"TeamMember"
see if you get any results and refine from there.
Debugging this might also be easier if you store the values as well as index them (so you can see exactly what is indexed). To do this, add an object to each index call { "store": true }. For example,
index("languages_known", doc.personal.languages_known[c], { "store": true });
Now, when you query the index it will return a list of fields which were stored with each match.
I'm currently trying to do something fancy in elasticsearch...and it ALMOST works.
Use case: I have to limit the number of results per a certain field to (x) results.
Example: In a result set of restaurants I only want to return two locations per restaurant name. If I search Mexican Food, then I should get (x) Taco Bell hits, (x) Del Taco Hits and (x) El Torito Hits.
The Problem: My aggregation is currently only matching partials of the term.
For Instance: If I try to match company_name, it will create one bucket for taco and another bucket for bell, so Taco Bell might show up in 2 buckets, resulting in (x) * 2 results for that company.
I find it hard to believe that this is the desired behavior. Is there a way to aggregate by the entire search term?
Here's my current aggregation JSON:
"aggs": {
"by_company": {
"terms": {
"field": "company_name"
},
"aggs": {
"first_hit": {
"top_hits": {"size":1, "from": 0}
}
}
}
}
Your help, as always, is greatly appreciated!
Yes. If your "company_name" is just a regular string with the standard analyzer, OR your whatever analyzer you are using for "company_name" is splitting the name then this is your answer. ES stores "terms", not words, or entire text unless you are telling it to.
Assuming your current analyzer for that field does just what I described above, then you need another - let's call it "raw" - field that should mirror your company_name field but it should store the company name as is.
This is what I mean:
{
"mappings": {
"test": {
"properties": {
...,
"company_name": {
"type": "multi_field",
"fields": {
"company_name": {
"type": "string" #and whatever you currently have in your mapping for `company_name`
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
And in your query, you'll do it like this:
"aggs": {
"by_company": {
"terms": {
"field": "company_name.raw"
},
"aggs": {
"first_hit": {
"top_hits": {"size":1, "from": 0}
}
}
}
}