Natural language processing in vespa.ai - vespa

{
"yql": "select * from sources post where text contains \"brandmüller\";",
"locale": "en"
}
The query does not yield the expected results.
If I change the query from brandmüller to Brandmüller (titlecase) or locale to de, everything works.
Admittedly, this feature is clever, because Brandmüller is right. But for some reasons I would prefer to simply ignore the case. Is there an option to disable the uppercase/lowercase feature in the query api?

See https://docs.vespa.ai/documentation/linguistics.html - this is most likely a feature of normalization
It is useful to add &tracelevel=5 to the query (some number, in/decrease as needed) to see the effect of query processing
most often, leaving default processing on is what you want (i.e. lowercase). It is possible to exclude searchers in the query processing chain, though, easier to discuss once you have the processing trace
https://docs.vespa.ai/documentation/text-matching-ranking.html#match-configuration-debug is useful, and see vespa-index-inspect / vespa-attribute-inspect in the same document for how to see how the terms are indexed

You can inspect the query handling, parsing, stemming etc by adding tracelevel=3
E.g
https://api.cord19.vespa.ai/search/?query=Brandm%C3%BCller&tracelevel=3
Stemming: [select * from sources * where default contains ([{"origin": {"original": "Brandm\u00FCller", "offset": 0, "length": 11}, "stem": false}]"brandm\u00FCller") timeout 1999;]
https://api.cord19.vespa.ai/search/?query=Brandm%C3%BCller&tracelevel=3&language=de
"Stemming: [select * from sources * where default contains ([{"origin": {"original": "Brandm\u00FCller", "offset": 0, "length": 11}, "stem": false}]"brandmull") timeout 1997;]"
},
There should be no difference in case, but stemming depends on the language, e.g brandmüller will be stemmed to brandmull for both brandmüller and Brandmüller.

Matching in Vespa is case-insensitive, but stemming and normalization is not (in general).
I guess the data here is indexed with locale "de", and when you query with locale "en" you get different (wrong) stemming, but only for the lowercase version. You can verify this with tracelevel (tracelevel=1 is sufficient for that).
In general, if you have a single language, it's best to always just set locale to that explicitly (by default, locale is guessed, but this unreliable for very short texts such as many queries). If you're dealing with multiple languages things get a bit more complicated.

Related

Ibm Watson Conversation Fuzzy Matching update causing issue with existing entities

The Fuzzy matching feature of Ibm watson conversation since its latest update is matching words incorrectly. Eg. "what" is getting picked up as entity "chatbot" whereas there is no synonym in chatbot entity that is even close to what.
My question is that is there a way to exclude words from fuzzy matching yet keeping it ON for the entity. Or any other solution to tackle this problem.
Thanks
I assume you have an entity in chatbot for 'chat bot', and its getting a partial match on chat, and then doing fuzzy match from 'chat' to 'what' because its only one character difference and could be a spelling error.
You can turn fuzzy matching off, but you cannot currently blacklist any specific words. You can also try to protect yourself by your dialog design in that youre only looking for #chatbot at certain points, so it shouldn't interrupt very often
I know what you mean, we need to use fuzzy matching, but it sometimes creates more trouble. We have had a number of words picked up and reported as something different. The method we use to remove some of the issues, is to view the confidence value that's given for the incorrect spelling "what" .. and then using this as an additional condition.
i.e. if "what" reports a confidence value of 0.6 then set your condition to be 0.7 .. entities['chatbot']?.confidence > 0.7
Fuzzy logic can be switched on or off for each individual "class" of entities, i.e. 'chatbot' in the example above or 'city' in many of the doc examples.
I don't believe you can set a one global condition that checks all entities for there confidence value, so you do need to check the confidence at the class level. As shown above.
Also at present you cannot blacklist individual words to stop the fuzzy logic from checking them, like 'what' in your example.
Yes, you can definitely examine the confidence value. One concern I have about that is that you have no idea how many entities you are receiving, so you will have to write some fairly complex logic, but if you only have one entity, its pretty simple. When we detect entities, we return this:
"entities": [
{
"entity": "appliance",
"location": [
23,
29
],
"value": "wipers",
"confidence": 1
},
{
"entity": "appliance",
"location": [
11,
18
],
"value": "lights",
"confidence": 0.87
}
]
So to access the confidence of an entity you would do entity[0].confidence > 0.x in your dialog trigger.

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

No result return by Solr when query contains word that is not in the collection

I am trying to set up Solr but encountered the problem mentioned in the title. I just downloaded Solr and used the built-in example. When I used a query with words occurred in the example documents, such as "ipod". Solr worked properly. However, when I added some words that are not in these documents, such as "what". Solr does not return anything. For me, it is weird since the relevance scores should be computed to query terms separately and added up. Non-existing query term should not affect the ranking (even though the coord norm is affected, thus the scores of documents will change).
Could anyone tell me what might be the issue? Thanks.
There are several ways of configuring how you want this behavior. I'll assume that you're using the edismax query handler for these examples, although some of these also apply to the standard lucene query parser.
The reason for not always wanting "ipod what" to retrieve the same subset sa "ipod" is that you'll get a poor result set and user experience for terms that are more general than "ipod" (i.e. searching for "microsoft windows" will not be perceived as a good search result if you're showing only general hits for anything about windows - it's usually better to say "we didn't find anything" in those cases). It all depends on your use case.
First, you can do it yourself, by applying either AND or OR between terms to get the exact kind of matching you're looking for.
You can use q.op to configure wether each term should be AND-ed together (all required) or OR-ed together (any one is sufficient). This overrides the (now deprecated) value from <solrQueryParser defaultOperator=".."/> in schema.xml.
For (e)dismax, there's the mm parameter, which allows you do more specific, but in a general way, handling of how you want matches to be performed. mm allows you to say "at least 50% of the terms should match" or "if there's only two terms, both should match, but any over that should be optional" or "match everything up to four, and 75% after that".

How to exclude results for certain words like "West Virgina" when searching for "Virginia" in a US state list?

I've got SOLR happily running indexing a list of department names that contain US states. It is working well however, searching for "Virginia" will turn up results containing "West Virginia", and while certainly helpful for some business requirements, is not in ours.
Is there a special way of saying that a query for X must not contain Y (I don't mind crafting a special query for the case of "Virginia"), or can I only do this post-query by iterating over the results and excluding results with "West Virginia"?
Use a minus sign (hyphen) combined with the phrases/terms you want to exclude. If you use the dismax query parser, then you don't even need to specify field names.
Examples:
using dismax:
q=virginia -"west virginia"
using standard query parser:
q=field_name:(virginia -"west virginia")
Refer to the Solr Query Syntax wiki page and its further links for more examples.
You could make a state field that is a string type and just search on state:"virginia" (lowercase the string before indexing / searching)

Solr minimum match results ranking

In my Rails application I have a Question model, setup with sunspot solr, with a field "text" and I'd like to search in that field doing a logical OR between words. I've found that setting minimum_match to 1 solves my problem, however I'd like also to order the results by boosting questions that have more than 1 word matching. Is there a way to do this with Solr? The documentation isn't really helpful about ranking functions.
Edit: this is the full query I'm performing in the controller
#questions = Question.solr_search do
fulltext params[:query], :minimum_match => 1
end.results
According to http://wiki.apache.org/solr/SchemaXml,
The default operator used by Solr's query parser (SolrQueryParser) can
be configured with
<solrQueryParser defaultOperator="AND|OR"/>.
The default operator is "OR" if unspecified. It is preferable to not use
or rely on this setting; instead the request handler or query
LocalParams should specify the default operator. This setting here can
be omitted and it is being considered for deprecation.
You can change your defaultOperator in solr/conf/schema.xml or you could use LocalParams to specify OR via syntax like https://github.com/sunspot/sunspot/wiki/Building-queries-by-hand
It is true Sunspot's default operator is "AND", as referenced in https://github.com/sunspot/sunspot/blob/master/sunspot_solr/solr/solr/conf/schema.xml
Logical OR is the default behavior of the Dismax request handler used in Sunspot.
Plus, the more words match, the higher the document's score (which sounds like what you want)
Question.search do
fulltext 'best pizza'
end
...should return results that match one or both words (returning the ones that match both first):
"Joe's has the best pizza by the slice in NYC"
"It's hard to say which pizza place is the best"
"Pizza isn't the best food for you"
"I don't care whether pizza is bad for you!"
"What do you think the best type of fast food is?"
minimum_match is useful only if you want to filter out low relevance results (where only a certain low number or percentage of terms were actually matched). This doesn't affect scoring or logical OR/AND behavior.

Resources