How to highlight all matched word in single snippet in solr

How to highlight all matched word in single snippet in solr - solr

All:
Right now, I am using SOLR highlight feature, but one thing I want to ask is:
Suppose I want to search keyword fund and value:
fund AND value
And the return highlight part is like:
"highlighting": {
"blk_0019": {
"content": [
"philosophy of the <em>fund</em> – <em>value</em> and turning point. \n \n MUSA was an orphaned"
]
},
"blk_0006": {
"content": [
"Global Equities <em>Fund</em> Ltd. \n \n CONFIDENTIAL enclosed"
]
}
}
The problem is I am sure blk_0019 and blk_0006 have both fund and value(obviously I use fund AND report), because the I set hl.fragsize=100, if the fund and value located not close enough in one document, they can not be shown both in same snippet. In blk_0019, solr highlights both fund and value, but in blk_0006, only fund shown.
How can I show both matched in single snippet and just ignore text between them as ..... like in Google
Also some small questions are:
[1] How to specify to search capitalized only word like Hello HELLO
in Solr?
[2] How to search All-capital AND(All-capital "AND" will be consider as logical operator)
Thanks

It depends on the highlighter you are using. For the Standard Highlighter you can set hl.snippets=5 for instance (default is 1). Then you'll get 5 snippets/fragments (at most), each with a maximum length of hl.fragsize.
They're returned as multiple values, so you'll need to join them yourself (using "..." for instance).

Related

Azure search not behaving as expected for dashes

I'm having an issue when using azure search for the following example data set: abc-123-456, abc-123-457, abc-123-458, etc
When making the search for abc-123-456, I'd expected to only return one results but instead getting all results containing abc-123-...
Is there some setting or way to change this behavior?
Current search settings:
TheSearchIndex.TokenFilters.Add(new EdgeNGramTokenFilter("frontEdgeNGram")
{
Side = EdgeNGramTokenFilterSide.Front,
MinGram = 3,
MaxGram = 20
});
TheSearchIndex.Analyzers.Add(new CustomAnalyzer("FrontEdgeNGram", LexicalTokenizerName.Whitespace)
{
TokenFilters =
{
TokenFilterName.Lowercase,
new TokenFilterName("frontEdgeNGram"),
TokenFilterName.Classic,
TokenFilterName.AsciiFolding
}
});
SearchOptions UsersSearchOptions = new SearchOptions
{
QueryType = SearchQueryType.Simple,
SearchMode = SearchMode.All,
};
Using azure.search.documents ver 11.1.1
Edit: Search with abc-123-456* with the asterisk gives me the one result as expected. How to get this behavior working as default?
Just to add to this..
The portal version is 2020-06-30
The sdk version we use is azure.search.documents ver 11.1.1
abc-123-456 does NOT work as expected
"abc-123-456" does NOT work as expected
"abc-123-456"* does NOT work
"abc-123-456*" does NOT work
If we append an asterisks to the end of the search text and it is not within a phrase .. it works as expected.
IE:
abc-123-456* works as expected.
(abc-123-456* | abc-123-457* ) works as expected.
Why is the asterisks required? How can we make this work within a phrase?

This is expected behavior when using the EdgeNGramTokenFilter inside the custom analyzer configuration. The text “abc-123-456” is broken into smaller tokens like “abc”, “abc-1”, “abc-12”, “abc-123”….”abc-123-456”. Check out the Analyzer API for the full list of tokens generated by a particular analyzer.
For a query - abc-123, if the default analyzer is being used, the query terms will be abc and 123 and will match all the documents that contain these terms.
The prefix query on the other hand is not analyzed and looks for documents that contain the prefix as is “abc-123”. A prefix search bypasses full-text search and looks for verbatim matches, which is why the correct result is coming back. Full-text search is over tokens in inverted indexes. Everything else (filters, fuzzy, regex, prefix/wildcard, etc.) is over verbatim strings in a separate unprocessed/internal index.
Another way can be to set only the search analyzer on the field to keyword to avoid breaking the input query.

Why does solr-node query gives a wrong document version number?

I am using Solr 7.6. While performing a search query, Solr gives a wrong version field of a document but all the other fields are correct.
In Solr dashboard the query gives the following result:
{
"id":"518fce46-3617-4380-aaf6-8f6d36e08e6a",
"type":"tag",
"count":1,
"_version_":1626999925241806848
}
Whereas, solr-node search function gives:
{
"id": "518fce46-3617-4380-aaf6-8f6d36e08e6a",
"type": "tag",
"count": 1,
"_version_": 1626999925241806800
}

Initial guess is that the solr-node module returns the value as a double (instead of as a string), and the precision of a double isn't good enough to represent the value 1626999925241806848 exactly.
We can confirm this directly in our browser's console:
-> 1626999925241806848
<- 1626999925241806800
i.e. if we input the numeric value 1626999925241806848, it'll be represented by the floating point number that's closest, and that's 1626999925241806800.
solr-node should probably return these values as a string when they exceed the representable value for ints.
Update: solr-node details this at their overview page:
Use json-bigint to handle correctly numbers too large for Javascript Number such as the values of the fields *l and _version. By default json-bigint library is not used because the performance difference compared to the native JSON library is too important with "large" chunk of JSON (https://github.com/lbdremy/solr-node-client/issues/114#issuecomment-54165595), but you want to enable it if you use the Optimistic Concurreny feature available in Solr 4.x, along with RealTime Get and Atomic Updates features because they use the version field. In order to enable it do var client = solr.createClient({ bigint : true}) or directly on the client instance client.options.bigint = true.

Alexa custom slot that takes any word or phrase

What samples can one add to a custom slot to make it accept any word or phrase?

Update
This solution has been outdated with the introduction of phrase slots eg. AMAZON.SearchQuery.
From the Announcements
Phrase slots are designed to improve speech recognition accuracy for
skills where you cannot include a majority of possible values in the
slot while creating the interaction model. The first slot available in
this category is AMAZON.SearchQuery is designed to provide improved
ability for you to collect generic speech from users.
The Problem
Having worked on developing an urban dictionary skill over the weekend to polish up on my Alexa skills I ran into a problem which I think a lot of skill developers might encounter.
TL;DR
Namely, how do you train Alexa on a custom slot to be able to take any value you give it?
First Attempts
At first I added about 5 words to the custom slot samples like bae, boo, ship it. But I quickly found that the skill would only work with those 5 words and I'd get no calls to my lambda function for words outside that list.
I then used
from nltk.corpus import words
import json, random
words_list = random.shuffle(words.words()[:1000])
words_list = [word.lower() for word in words_list]
words_list = list(set(words_list))
values = []
for word in words_list:
value = {}
value['id'] = None
value['name'] = {}
value['name']['value'] = word
value['name']['synonyms'] = []
values.append(value)
print(json.dumps(values))
The above code uses nltk, which you can install with pip install nltk, to generate a 1000 words according to the schema you can find under code editor, it produce a thousand of these;
{
"id": null,
"name": {
"value": "amblygeusia",
"synonyms": []
}
}
I copied and pasted these under values, you can find the whole file under Code Editor on the Skills Builder page.
"languageModel": {
"types": [
{
"name": "phrase", //my custom slot's name
"values": [...] //pasted the thousand words generated here
...
After saving and building in the Skills Builder UI. This only allowed my skill to capture single word slot values. I tried generating 10 000 words in the same way and adding them as samples for the custom slot but two word and three words phrases weren't recognised and the skill failed to get the definition of phrases like;
ship it
The Solution;
What worked for me and worked really well was to generate two word samples. Despite all the examples being two worded, the skill was then able to recognise single word values and even three word values.
Here's the code to do that using nltk;
from nltk.corpus import words
import json, random
words_list = random.shuffle(words.words()[:1000])
words_list = [word.lower() for word in words_list]
words_list = list(set(words_list))
word_pairs = []
for word in words_list:
pair = ' '.join(random.sample(words_list, 2))
word_pairs.append(pair)
word_pairs = list(set(word_pairs))
for pair in word_pairs:
value = {}
value['id'] = None
value['name'] = {}
value['name']['value'] = pair
value['name']['synonyms'] = []
values.append(value)
print(json.dumps(values))
I put that in a file called custom_slot_value_geneator.py and ran it with;
python3 custom_slot_value_geneator.py | xclip -selection c
This generates the values and copies them to the clipboard.
I then copied them into the Code Editor under values, replacing the old values
"languageModel": {
"types": [
{
"name": "phrase", //my custom slot's name
"values": [...] //pasted the thousand two word pairss generated here
...
Save & Build.
That's it! Your skill would then be able to recognize any word or phrase for your custom slot, whether or not it's in the sample you generated!

Hybris : How to exclude Facet Value from the Solr Result

Hybris 5.2
I was doing some analysis to exclude Facet Value from the Solr Search so that those products will not come in the search result.
Suppose I have lots of Color T-Shirts (Don't know how many colors) and someone told me to not show Red color T-Shirts in the search result.
There are two options which I can think
Option 1 : I have to get all the T-Shirt's colors available in the System then add a Filter in Solr Result
For Example
List<String> colorList = getAllColorsExceptRed(); //Get all colors except red
for(String color : colorList) {
searchQuery.addFacetValue("color", color);
}
This will add a filter of color SolrIndexedProperty and will solve the problem.
But I am not curious to pickup this approach.
Option 2 : Exclude Red Color property from Solr Search result rather than applying filter on all the colors.
Solr Query would be like this ..
q= *:* AND -color_string:red
//in case of multiple color exclude
q= *:* AND -color_string: (red white)
This will exclude red T-Shirt from the result. But I am not able to find which Service or Method I should choose to make a query like this.
Can anybody know how to achieve this Query (q= *:* AND -color_string:red) with service/method/searchQuery in Hybris ?

So After some hit and try, I got the Solution.
In searchQuery, we can add Raw Query as well. So I have set the query in addRawQuery method.
final String colors = "red white"; // List we can get from property file as well
searchQuery.addRawQuery("-color_string:(" + colors + ")",Operator.AND);
This makes it work!!

SolrIndexedProperty has attribute, includeInResponse if you set it as false, it will not be sent in result.

How to query atom field with unicode value in Google App Engine production search?

I wrote some text search with use Google App Engine search.
In SDK I tested such query on atom field:
u'tag:"wartości"'
In production I run the same query but it not works on same data.
How can I do unicode query on atom field?
Is it possible to use unicode in Google App Engine search?

We are aware of this issue and plan to fix ASAP. The fix that we're currently planning will require that the atom field value include exactly the same accent characters in order to match. Matches will continue to be case-insensitive. We expect that at least initially, values that use combining diacritical marks will be treated as different values than those using precomposed characters. We may revisit that decision depending on feedback, but it's the most straightforward fix on our end.
For more on the precomposed characters vs. combining diacritical marks, see this Wikipedia article:
http://en.wikipedia.org/wiki/Precomposed_character
Chris

It looks that I need translate AtomField values into new string and I need to translate queries too. This workaround will allow only Polish unicode search. I do not know tonkenization rules so I use 'q', 'x' to expand alphabet since not used in Polish.
# coding=utf-8
translate = {
u'ą': u'aq',
u'Ą': u'Aq',
u'ć': u'cq',
u'Ć': u'Cq',
u'ę': u'eq',
u'Ę': u'Eq',
u'ł': u'lq',
u'Ł': u'Lq',
u'ń': u'nq',
u'Ń': u'Nq',
u'ó': u'oq',
u'Ó': u'Oq',
u'ś': u'sq',
u'Ś': u'Sq',
u'ż': u'zx',
u'Ż': u'Zx',
u'ź': u'zq',
u'Ź': u'Zq',
}
import re
reTranslate = re.compile(u'(%s)' % u'|'.join(translate))
print reTranslate.pattern
test = u"""\
Właściwie prowadzona komunikacja wewnętrzna w firmie,\
zwłaszcza dużej czy posiadającej rozproszoną sieć oddziałów,\
może przynieść oszczędność czasu, a co za tym idzie, również pieniędzy."""
print reTranslate.sub(lambda match: translate[match.group(0)], test)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to highlight all matched word in single snippet in solr - solr

Related

Azure search not behaving as expected for dashes

Why does solr-node query gives a wrong document version number?

Alexa custom slot that takes any word or phrase

Hybris : How to exclude Facet Value from the Solr Result

How to query atom field with unicode value in Google App Engine production search?

Categories

Resources