Choosing right analyzers for Azure Search

Choosing right analyzers for Azure Search - azure-cognitive-search

We have created index in Azure Search Service as below:
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "SWMLuceneAlongWithCustomHyphenAnalyser",
"tokenizer": "keyword_v2",
"tokenFilters": [
"lowercase"
],
"charFilters": []
}
This analyzer is assigned to a property called "lowerMachineTag". Now when we search using below query, we get expected result:
Query: search=lowerSystemID:/.*it\'s.*/lowerMachineTag:/.*it\'s.*/&$filter=(systemID%20ne%20null)%20and%20(ownerSalesforceRecordID%20eq%20'a0h5B000000gJKfQAM')&$count=true&$top=100&$skip=0
Results:
{
"#odata.context": "https://abcd/indexes('orders-index')/$metadata#docs",
"#odata.count": 4,
"value": [
{
"#search.score": 0.1862714,
"systemID": "*1QXEDL8E2V8MGBY",
"machineTag": "It's me",
"systemIDMachineTag": "*1QXEDL8E2V8MGBY|It's me",
"machineTagSystemID": "It's me|*1QXEDL8E2V8MGBY",
"lowerMachineTag": "it's me",
"lowerSystemID": "*1qxedl8e2v8mgby",
"ownerSalesforceRecordID": "a0h5B000000gJKfQAM",
"parentSalesforceRecordID": "a0h5B000000gJKfQAM"
},
{
"#search.score": 0.16417237,
"systemID": "*1QXEDL8E2V8MGBY",
"machineTag": "It's me",
"systemIDMachineTag": "*1QXEDL8E2V8MGBY|It's me",
"machineTagSystemID": "It's me|*1QXEDL8E2V8MGBY",
"lowerMachineTag": "it's me",
"lowerSystemID": "*1qxedl8e2v8mgby",
"ownerSalesforceRecordID": "a0h5B000000gJKfQAM",
"parentSalesforceRecordID": "a0h5B000000gJKfQAM"
},
{
"#search.score": 0.16417237,
"systemID": "*1QXEDL8E2V8MGBY",
"machineTag": "It's me",
"systemIDMachineTag": "*1QXEDL8E2V8MGBY|It's me",
"machineTagSystemID": "It's me|*1QXEDL8E2V8MGBY",
"lowerMachineTag": "it's me",
"lowerSystemID": "*1qxedl8e2v8mgby",
"ownerSalesforceRecordID": "a0h5B000000gJKfQAM",
"parentSalesforceRecordID": "a0h5B000000gJKfQAM"
},
{
"#search.score": 0.16417237,
"systemID": "*1QXEDL8E2V8MGBY",
"machineTag": "It's me",
"systemIDMachineTag": "*1QXEDL8E2V8MGBY|It's me",
"machineTagSystemID": "It's me|*1QXEDL8E2V8MGBY",
"lowerMachineTag": "it's me",
"lowerSystemID": "*1qxedl8e2v8mgby",
"ownerSalesforceRecordID": "a0h5B000000gJKfQAM",
"parentSalesforceRecordID": "a0h5B000000gJKfQAM"
}
]
}
But what would be the general recommendations for analyzer configuration, if we should have results returned even when we search for lowerMachineTag:/.it./ in additon to above behavior

It seems you are using a regular expression in the search query – for that to work you would have to also add “&queryType=full” to your query string. Otherwise, the entire search term (“lowerSystemID:/.*it\'s.*/lowerMachineTag:/.*it\'s.*/”) would be understood as a simple query, meaning it would be analyzed using the standard analyzer and matched against any searchable field. By adding “&queryType=full” your regex would be understood as such and matched only to the specified fields.
As per your question, if “lowerMachineTag:/.it./” is specified, it wouldn’t match any of the four documents above, as the ‘.’ at the start of the regex would try to match a character before the “it” characters and at least in the four documents above the value of “lowerMachineTag” always starts with “it”.
If you are to remove the starting ‘.’ character, using only “lowerMachineTag:/it./”, it still would not match, because the regex must match the entire token (adding ‘’ would work though: “lowerMachineTag:/it./”).
You can change the analyzer definition to make “/it./” work by also using nGram_v2 token filter, like so:
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "SWMLuceneAlongWithCustomHyphenAnalyser",
"tokenizer": "keyword_v2",
"tokenFilters": [
"lowercase", “myNGramTokenFilter”
],
"charFilters": []
},
"tokenFilters":[
{
"name":"myNGramTokenFilter",
"#odata.type":"Microsoft.Azure.Search.NGramTokenFilterV2",
"minGram":1,
"maxGram":100
}
]
This would still make you original query (+ "queryType=full") return the same results and would also return the results when using "lowerMachineTag:/it./".
I hope this helps!

Related

How does a predefined slot return a resolution?

I'm building a simple Guess Who skill game for Alexa. I have two intents right now: GenderIntent and HairColorIntent.
GenderIntent has a custom slot to handle gender and related synonyms such as mapping "boy" and "man" to "Male". This is working great. It returns a resolution within the slot. Exactly what I need.
HairColorIntent has a predefined Amazon slot, AMAZON.Color. This is not working great as it never returns a resolution regardless of the color supplied.
Here is my model for GenderIntent and HairColorIntent:
{
"name": "GenderIntent",
"samples": [
"are you a {Gender}"
],
"slots": [
{
"name": "Gender",
"type": "GENDER_TYPES",
"samples": []
}
]
},
{
"name": "HairColorIntent",
"samples": [
"is your hair {HairColor}",
"do you have {HairColor} hair"
],
"slots": [
{
"name": "HairColor",
"type": "AMAZON.Color"
}
]
}
GenderIntent returns the following slot WITH resolutions:
{
"Gender": {
"name": "Gender",
"value": "male",
"resolutions": {
"resolutionsPerAuthority": [
{
"authority": "amzn1.er-authority.echo-sdk.amzn1.ask.skill.2ed972f4-1c5a-4cc1-8fd7-3f440f5b8968.GENDER_TYPES",
"status": {
"code": "ER_SUCCESS_MATCH"
},
"values": [
{
"value": {
"name": "Male",
"id": "63889cfb9d3cbe05d1bd2be5cc9953fd"
}
}
]
}
]
},
"confirmationStatus": "NONE",
"source": "USER"
}
}
HairColorIntent returns the following WITHOUT resolutions:
{
"HairColor": {
"name": "HairColor",
"value": "brown",
"confirmationStatus": "NONE",
"source": "USER"
}
}
I'd like HairColorIntent's HairColor slot to return the resolution. What am I doing wrong?

Resolution is only returned if you use synonyms in your slot type.
Not exactly sure how you handle it in your code, for example Node.js would be:
handlerInput.requestEnvelope.request.intent.slots.Gender.resolutions.resolutionPerAuthority[0].values[0].value.name
If you do not use synonyms (for example for the HairColor slot), you can get the value simply by handlerInput.requestEnvelope.request.intent.slots.HairColor.value
Working with predefined slot types this should work well with your code. If you want custom slot types to also return resolution whether you actually use synonyms or not, you can always just simply give the value as a synonym and it should return the full resolution tree.
Hope that answered your question.

Pact, ensure key names in array

If returned json is a map, all key names specified in body response will be proved for existence. So
...
"response":
{
"status": 200,
"body":
{
"field1": "value1"
}
...
will ensure, that body contains a key "field1", if it is missing, an error occurs.
But what if response body is an array? I see no chance to test, if all or at least one element in this array have a specific key name. But this is important, I want to be warned if key names in backend are changing, because that would create errors in my application.

You can use eachLike to specify that array elements match a particular format. The correct syntax depends on which Pact framework you're using, but with pact-js, you would say:
const { somethingLike: like, term, eachLike } = pact
....
willRespondWith: {
status: 200,
body: eachLike({
"field1": "value1"
})
}
Here is the relevant part of the documentation.
Your example suggests you're writing the Pact file yourself - if this is the case, you can use the [*] notation to describe any array element, as described in the specification:
"response":
{
"status": 200,
"body":
[
{
"field1": "value1"
}
],
...
"matchingRules": {
"$.body": {
"min": 1,
"match": "type"
},
"$.body[*].field1": {
"match": "type"
},
...

Elasticsearch not returning hits for multi-valued field

I am using Elasticsearch with no modifications whatsoever. This means the mappings, norms, and analyzed/not_analyzed is all default config. I have a very small data set of two items for experimentation purposes. The items have several fields but I query only on one, which is a multi-valued/array of strings field. The doc looks like this:
{
"_index": "index_profile",
"_type": "items",
"_id": "ega",
"_version": 1,
"found": true,
"_source": {
"clicked": [
"ega"
],
"profile_topics": [
"Twitter",
"Entertainment",
"ESPN",
"Comedy",
"University of Rhode Island",
"Humor",
"Basketball",
"Sports",
"Movies",
"SnapChat",
"Celebrities",
"Rite Aid",
"Education",
"Television",
"Country Music",
"Seattle",
"Beer",
"Hip Hop",
"Actors",
"David Cameron",
... // other topics
],
"id": "ega"
}
}
A sample query is:
GET /index_profile/items/_search
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"profile_topics": [
"Basketball"
]
}
}]
}
}
}
Again there are only two items and the one listed should match the query because the profile_topics field matches with the "Basketball" term. The other item does not match. I only get a result if I ask for clicked = ega in the should.
With Solr I would probably specify that the fields are multi-valued string arrays and are to have no norms and no analyzer so profile_topics are not stemmed or tokenized since all values should be treated as tokens (even the spaces). Not sure this would solve the problem but it is how I treat similar data on Solr.
I assume I have run afoul of some norm/analyzer/TF-IDF issue, if so how do I solve this so that even with two items the query will return ega. If possible I'd like to solve this index or type wide rather than field specific.

Basketball (with capital B) in terms will not be analyzed. This means this is the way it will be searched in the Elasticsearch index.
You say you have the defaults. If so, indexing Basketball under profile_topics field means that the actual term in the index will be basketball (with lowercase b) which is the result of the standard analyzer. So, either you set profile_topics as not_analyzed or you search for basketball and not Basketball.
Read this about terms.
Regarding to setting all the fields to not_analyzed you could do that with a dynamic template. Still with a template you can do what Logstash is doing: defining a .raw subfield for each string field and only this subfield is not_analyzed. The original/parent field still holds the analyzed version of the same text, maybe you will use in the future the analyzed field.
Take a look at this dynamic template. It's the one Logstash is using.
More specifically:
{
"template": "your_indices_name-*",
"mappings": {
"_default_": {
"_all": {
"enabled": true,
"omit_norms": true
},
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}

Multiple search filtering is not working in cloudant, why?

Here i quoted my code for multiple search filtering. I could not find the mistakes in that. please give a right code to make it work well.
Employee document:
{
"_id": "527c8d9327c6f27f17df0d2e17000530",
"_rev": "24-276a8dc913559901897fd601d2f9654f",
"proj_role": "TeamMember",
"work_total_experience": "3",
"personal": {
"languages_known": [
"English","Telugu"
]},
"skills": [
{
"skill_set": "Webservices Framework",
"skill_exp": 1,
"skill_certified": "yes",
"skill_rating": 3,
},
{
"skill_set": "Microsoft",
"skill_exp": 1,
"skill_certified": "yes",
"skill_rating": 3,
}
]
"framework_competency": "Nasscom",
"type": "employee-docs"
}
Design Document:
{
"_id": "_design/sample",
"_rev": "86-1250f792e6e84f6f33447a00cf64d61d",
"views": {},
"language": "javascript",
"indexes": {
"search": {
"index": "function(doc){\n index(\"default\", doc._id);if(doc.type=='employee-docs'){\nif (doc.proj_role){index(\"project_role\", doc.proj_role);}if(doc.work_total_experience){\nindex(\"work_experience\", doc.work_total_experience);}\nif(doc.personal.languages_known){for(c in doc.personal.languages_known){ \n index(\"languages_known\",doc.personal.languages_known[c]);}} if(doc.skills){for (var i=0;i<doc.skills.length;i++){\nindex('skill_set',doc.skills[i].skill_set);}}}}"
}
}
}
Run using below URL : https://ideyeah4.cloudant.com/opteamize_new/_design/sample/_search/search?q=project_role:TeamMember%20AND%20work_experience:%223%22%20AND%20languages_known:Telugu%20AND%20skill_set:Microsoft&include_docs=true

A simple way to debug this is to query the top 100 results in your index:
https://ideyeah4.cloudant.com/opteamize_new/_design/sample/_search/search?q=*:*&limit=100
This will at least tell you whether there are any documents in your index at all.
Your current query (without URL encoding) looks like:
project_role:TeamMember AND work_experience:"3" AND languages_known:Telugu AND skill_set:Microsoft
I'd suggest that some of these search values require quotes - always true when you are searching string values. Next, you could try:
project_role:"TeamMember"
see if you get any results and refine from there.
Debugging this might also be easier if you store the values as well as index them (so you can see exactly what is indexed). To do this, add an object to each index call { "store": true }. For example,
index("languages_known", doc.personal.languages_known[c], { "store": true });
Now, when you query the index it will return a list of fields which were stored with each match.

MongoDB Array Query Performance

I'm trying to figure out what the best schema is for a dating site like app. User's have a listing (possibly many) and they can view other user listings to 'like' and 'dislike' them.
Currently i'm just storing the other persons listing id in a likedBy and dislikedBy array. When a user 'likes' a listing, it puts their listing id into the 'liked' listings arrays. However I would now like to track the timestamp that a user likes a listing. This would be used for a user's 'history list' or for data analysis.
I would need to do two separate queries:
find all active listings that this user has not liked or disliked before
and for a user's history of 'liked'/'disliked' choices
find all the listings user X has liked in chronological order
My current schema is:
listings
_id: 'sdf3f'
likedBy: ['12ac', 'as3vd', 'sadf3']
dislikedBy: ['asdf', 'sdsdf', 'asdfas']
active: bool
Could I do something like this?
listings
_id: 'sdf3f'
likedBy: [{'12ac', date: Date}, {'ds3d', date: Date}]
dislikedBy: [{'s12ac', date: Date}, {'6fs3d', date: Date}]
active: bool
I was also thinking of making a new collection for choices.
choices
Id
userId // id of current user making the choice
userlistId // listing of the user making the choice
listingChoseId // the listing they chose yes/no
type
date
I'm not sure of the performance implications of having these choices in another collection when doing the find all active listings that this user has not liked or disliked before.
Any insight would be greatly appreciated!

Well you obviously thought it was a good idea to have these embedded in the "listings" documents so your additional usage patterns to the cases presented here worked properly. With that in mind there is no reason to throw that away.
To clarify though, the structure you seem to want is something like this:
{
"_id": "sdf3f",
"likedBy": [
{ "userId": "12ac", "date": ISODate("2014-04-09T07:30:47.091Z") },
{ "userId": "as3vd", "date": ISODate("2014-04-09T07:30:47.091Z") },
{ "userId": "sadf3", "date": ISODate("2014-04-09T07:30:47.091Z") }
],
"dislikedBy": [
{ "userId": "asdf", "date": ISODate("2014-04-09T07:30:47.091Z") },
{ "userId": "sdsdf", "date": ISODate("2014-04-09T07:30:47.091Z") },
{ "userId": "asdfas", "date": ISODate("2014-04-09T07:30:47.091Z") }
],
"active": true
}
Which is all well and fine except that there is one catch. Because you have this content in two array fields you would not be able to create an index over both of those fields. That is a restriction where only one array type of field (or multikey) can be be included within a compound index.
So to solve the obvious problem with your first query not being able to use an index, you would structure like this instead:
{
"_id": "sdf3f",
"votes": [
{
"userId": "12ac",
"type": "like",
"date": ISODate("2014-04-09T07:30:47.091Z")
},
{
"userId": "as3vd",
"type": "like",
"date": ISODate("2014-04-09T07:30:47.091Z")
},
{
"userId": "sadf3",
"type": "like",
"date": ISODate("2014-04-09T07:30:47.091Z")
},
{
"userId": "asdf",
"type": "dislike",
"date": ISODate("2014-04-09T07:30:47.091Z")
},
{
"userId": "sdsdf",
"type": "dislike",
"date": ISODate("2014-04-09T07:30:47.091Z")
},
{
"userId": "asdfas",
"type": "dislike",
"date": ISODate("2014-04-09T07:30:47.091Z")
}
],
"active": true
}
This allows an index that covers this form:
db.post.ensureIndex({
"active": 1,
"votes.userId": 1,
"votes.date": 1,
"votes.type": 1
})
Actually you will probably want a few indexes to suit your usage patterns, but the point is now can have indexes you can use.
Covering the first case you have this form of query:
db.post.find({ "active": true, "votes.userId": { "$ne": "12ac" } })
That makes sense considering that you clearly are not going to have both an like and dislike option for each user. By the order of that index, at least active can be used to filter because your negating condition needs to scan everything else. No way around that with any structure.
For the other case you probably want the userId to be in an index before the date and as the first element. Then your query is quite simple:
db.post.find({ "votes.userId": "12ac" })
.sort({ "votes.userId": 1, "votes.date": 1 })
But you may be wondering that you suddenly lost something in that getting the count of "likes" and "dislikes" was as easy as testing the size of the array before, but now it's a little different. Not a problem that cannot be solved using aggregate:
db.post.aggregate([
{ "$unwind": "$votes" },
{ "$group": {
"_id": {
"_id": "$_id",
"active": "$active"
},
"likes": { "$sum": { "$cond": [
{ "$eq": [ "$votes.type", "like" ] },
1,
0
]}},
"dislikes": { "$sum": { "$cond": [
{ "$eq": [ "$votes.type", "dislike" ] },
1,
0
]}}
])
So whatever your actual usage form you can store any important parts of the document to keep in the grouping _id and then evaluate the count of "likes" and "dislikes" in an easy manner.
You may also not that changing an entry from like to dislike can also be done in a single atomic update.
There is much more you can do, but I would prefer this structure for the reasons as given.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight