Azure Cognitive Search hit highlights for phrase search operator

Azure Cognitive Search hit highlights for phrase search operator - azure-cognitive-search

We are trying to use Azure Cognitive Search to enable full-text search for the documents stored in Azure Blob Storage. One of the features that we need is to show the hit highlights for a particular document.
We've noticed that while the search for an exact phrase correctly matches only those documents that contain this exact phrase, the highlights are returned for the individual words in the phrase, instead of the full phrase.
Example
For the phrase search "supply agreement" we get highlights for "supply" and "agreement".
Request:
{
"search": "\"supply agreement\"",
"select": "metadata_storage_name,metadata_storage_path,language",
"searchFields": "merged_content",
"highlight": "merged_content"
}
Response:
{
"#odata.context": "https://....search.windows.net/indexes('...')/$metadata#docs(*)",
"value": [
{
"#search.score": 0.047654618,
"#search.highlights": {
"merged_content": [
"Customer has agreed to engage Supplier to <em>supply</em> the Products and Supplier has agreed to accept the engagement on the terms set out in this <em>Agreement</em>.",
"<em>Agreement</em>\n1.",
"Tax means goods and services, value added or similar consumption based tax applicable to the <em>supply</em> of the Products under this <em>agreement</em>.",
...
]
},
"metadata_storage_name": "a2b23e30-c1e0-4c52-a659-d8705662d699.docx",
"metadata_storage_path": "...",
"language": "en"
},
...
]
}
Is this a known issue of the current version of Azure Cognitive Search API?

Currently there is no way to do highlight the whole phrase, but I have good news for you.
The work to highlight phrases is one that we are tracking and plan to release, although I don't have a specific date to announce just yet.
Luis Cabrera - Principal Program Manager - Azure Cognitive Search

Related

Azure Search highlighting doesn't work for wildcards with scoring profiles

Azure Search supports highlighting with full text search which facilitates clients to locate the matched term in a returned document. I have provided a simple index schema below to illustrate the issue.
{
"name": "simple-index",
"fields": [
{
"name": "key",
"type": "Edm.String"
},
{
"name": "simplefield",
"type": "Edm.String"
}
],
"scoringProfiles": [
{
"name": "boostedprofile",
"functionAggregation": null,
"text": {
"weights": {
"simplefield": 5,
}
},
"functions": []
}
],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": []
}
For a normal search query like below, it works as expected and gives back the expected result.
search=foobar&highlight=simplefield
On extending the above query to use a wildcard query, things are again as expected with the response containing highlights on the terms matching the prefix. So far so good.
search=foo*&highlight=simplefield&querytype=full
After this when I apply a scoring profile on top of the previous query, the results are unexpected and no highlights are returned.
search=foo*&highlight=simplefield&querytype=full&scoringprofile=boostedprofile
How do I make highlights work for the wildcard queries when using a scoring profiles?

At the time of answering, this is a known limitation in Azure Search where highlighting doesn't work for wildcard queries when used with scoring profiles. Internally Azure Search uses a concept of highlighter which is responsible for the highlighting flow as a separate process that happens after search.
In the case of wildcard query, it involves looking up all terms in the index that match the provided prefix term and then use them to compose the highlighted text. Scoring profiles affect the way terms are looked up in index for highlighting. Due to that the result doesn't include any highlights.
As this is a specific limitation in wildcard queries, one workaround is to pre-process the index to avoid issuing wildcard/prefix queries. Please take a look at custom analysis (https://learn.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-azure-search) You can, for example, use edgeNgram tokenfilter and store prefixes of words in the index and issue a regular term query with the prefix (with out the '*' operator)
I hope this is useful. Please vote on the feedback item to help us prioritize our development efforts to support other modes of highlighting that will support the above use-case. https://feedback.azure.com/forums/263029-azure-search/suggestions/32661961-implement-other-highlighters

Include fields other than count in azure facet results?

While faceting azure search returns the count for each facet field by default.How do I also get other searchable fields for every facet?
Ex When I facet for area , I want something like this.(description is a searchable field)
{
"area": [
{
"count": 1,
"description": "Acrylics",
"value": "ACR"
},
{
"count": 1,
"description": "Power",
"value": "POW"
}
]
}
Can someone please help with the extra parameters I need to send in the query?

Unfortunately there is no good way to do this as there is no direct support for nested faceting in Azure search (you can upvote it here). To achieve the result you want you would need to store the data together as a composite value as described by this workaround.

Does it make sense to model a Solr document if search does not allow to specify the attributes?

I want to provide a search feature in my site where the user is able to search by text only, without specifying the attributes.
For example, instead of allowing the user to search by "author=George Martin" he will simply query "George Martin".
I would like to know if there is any advantage in a document model like this one:
{
"id": 1,
"title": "Game of Thrones",
"author": "George R. R. Martin",
"published": "August, 1996"
}
Compared to:
{
"id": 1,
"data": [
"Game of Thrones",
"George R. R. Martin",
"August, 1996"
]
}
If I'm not going to use "author:value" in the Solr API, I should get the same results, right?

The first version will allow you to assign different weights to the different fields. I.e. a hit in the title might be more important than a hit in the author field - or vice versa.
Using the edismax handler (defType=edismax) and query fields (qf=title author published) will give you the same behavior as your second example, but will retain the structure of the document.
As the fields are put into the qf parameter, there is no need for the user to explicitly tell Solr which fields she wants to search.
To give the fields different weights, assign a weight to the field in the qf list: qf=title^5 author^2 published will give a hit in title five times the weight than a hit in published - i.e. "The Hunt for Red October" will be more important than something published in October.

Lucene/Solr: Store offset information for certain keywords

We are using Solr to store documents with keywords; each keyword is associated with a span within the document.
The keywords were produced by some fancy analytics and/or manual work prior to loading them into Solr. A keyword can be repeated multiple times in a document. On the other hand, different instances of the same string in a single document can be connected with different keywords.
For example, this document
Bill studied The Bill of Rights last summer.
could be accompanied by the following keywords (with offsets in parentheses):
William Brown (0:4)
legal term (13:31)
summer 2011 (32:43)
(Obviously in other documents, Bill could refer to Bill Clinton or Bill Gates. Similarly, last summer will refer to different years in different documents. We do have all this information for all the documents.)
I know the document can have a field, say KEYWORD, which will store William Brown. Then when I search for William Brown I will get the above document. That part is easy.
But I have no idea how to store the info that William Brown corresponds to the text span 0:4 so I can highlight the first Bill, but not the second.
I thought I could use TermVectors, but I am not sure if/how I can store custom offsets. I would think this is a fairly common scenario ...
EDIT: edited to make clear that Bill can refer to different people/things in different documents.
EDIT2: edited to make clear that a document can contain homonyms (identical strings with different meanings).

Two Q Monte
Solution Pros:
Annotations logically stored with source docs
No knowledge of highlighter implementation or custom Java highlighter development required
Since all customization happens outside of Solr, this solution should be forward-compatible to future Solr versions.
Solution Cons:
Requires two queries to be run
Requires code in your search client to merge results from one query into the other.
With Solr 4.8+ you can nest child documents (annotations) underneath each primary document (text)...
curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
{
"id": "123",
"text" : "Bill studied The Bill of Rights last summer.",
"content_type": "source",
"_childDocuments_": [
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
},
{
"id": "123-2",
"content_type": "source_annotation",
"annotation": "legal term",
"start_offset": 13,
"end_offset": 31
},
{
"id": "123-3",
"content_type": "source_annotation",
"annotation": "summer 2011",
"start_offset": 32,
"end_offset": 43
}
]
}
]
... using block join to query the annotations.
1) Annotation Query: http://localhost:8983/solr/query?fl=id,start_offset,end_offset&q={!child of=content_type:source}annotation:"William Brown"
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123-1",
"content_type": "source_annotation",
"annotation": "William Brown",
"start_offset": 0,
"end_offset": 4
}
]
}
Store these results in your code so that you can fold in the annotation offsets after the next query returns.
2) Source Query + Highlighting: http://localhost:8983/solr/query?hl=true&hl.fl=text&fq=content_type:source&q=text:"William Brown" OR id:123
(id:123 discovered in Annotation Query gets ORed into second query)
"response":{"numFound":1,"start":0,
"docs":[
{
"id": "123",
"content_type": "source",
"text": "Bill studied The Bill of Rights last summer."
}
],
"highlighting":{}
}
Note: In this example there is no highlighting information returned because the search terms didn't match any content_type:source documents. However we have the explicit annotations and offsets from the first query!
Your client code then needs to take the content_type:source_annotation results from the first query and manually insert highlighting markers into the content_type:source results from the second query.
More block join info on Yonik's blog here.

By default Solr stores the start/end position of each token once is tokenized, for instance using the StandardTokenizer. This info is encoded on the underline index. The use case that you described here sounds a lot like the SynonymFilterFactory.
When you define a synonym using the SynonymFilterFactory stating for instance that: foo => baz foo is equivalent to bar, the bar term is added to the token stream generated when the text is tokenized, and it will have the same offset information than the original token. So for instance if your text is: "foo is awesome", the term foo will have the following offset information (start=0,end=3) a new token bar(start=0,end=3) will be added to your index (assuming that you're using the SynonymFilterFactory at index time):
text: foo is awesome
start: 0 4 7
end: 3 6 13
Once the SynonymFilterFactory is applied:
bar
text: foo is awesome
start: 0 4 7
end: 3 6 13
So if you fire a query using foo, the document will match, but if you use bar as your query the document will also match since a bar token is added by the SynonymFilterFactory
In your particular case, you're trying to accomplish multi-term synonyms, which is kind of a difficult problem, you may need something more than the default synonym filter of Solr. Check this post from the guys at OpenSourceConnections and this other post from Lucidworks (the company behind Solr/Lucene). This two posts should provide additional information and the caveats of each approach.
Do you need to fetch the stored offsets for some later processing?

Cloudant search documents that appear after certain id

There is a cloudant database that stores some documents.
There is also mobile app that takes those documents by using search indexes.
Question is:
Is it possible to make query "get me all indexes that appear after this one"?
For example:
I start app, and get from database documents with id 'aaa','aab' and 'aac'.
I want to store last id - 'aac' - in memory of my app.
Then, when I start the app, I want to get from database documents that appeared after 'aac'.
I think the main problem will be, that _ids are assigned as random strings, but I want to be sure.

when searching the index, try including the selector field in JSON object of the request body:
{
"selector": {
"_id": {
"$gt": "the_previous_id"
}
},
"sort": [
{
"_id": "asc"
}
]
}
in addition, from https://docs.cloudant.com/document.html:
"The _id field is either created by you, or generated automatically as a UUID by Cloudant."
therefore, it is possible to provide your own _ids when creating a document if the Cloudant generated _ids are not working for you.
condition operators:
https://docs.cloudant.com/cloudant_query.html#condition-operators