Are there any Lucene stemmers that handle Shakespearean English? - solr

I'm trying to index some old documents for searching -- 16th, 17th, 18th century.
Modern stemmers don't seem to handle the antiquated word endings: worketh, liveth, walketh.
Are there stemmers that specialize in the English from the time of Shakespeare and the King James Bible? I'm currently using solr.PorterStemFilterFactory.

It looks like the rule changes are minimal for that.
So, it might be possible to copy/modify the PorterStemmer class and related Factories/Filters.
Or it might be possible to add those specific rules as Regular expression filter before Porter.

Related

How to provide OpenNLP model for tokenization in vespa?

How do I provide an OpenNLP model for tokenization in vespa? This mentions that "The default linguistics module is OpenNlp". Is this what you are referring to? If yes, can I simply set the set_language index expression by referring to the doc? I did not find any relevant information on how to implement this feature in https://docs.vespa.ai/en/linguistics.html, could you please help me out with this?
Required for CJK support.
Yes, the default tokenizer is OpenNLP and it works with no configuration needed. It will guess the language if you don't set it, but if you know the document language it is better to use set_language (and language=...) in queries, since language detection is unreliable on short text.
However, OpenNLP tokenization (not detecting) only supports Danish, Dutch, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish and English (where we use kstem instead). So, no CJK.
To support CJK you need to plug in your own tokenizer as described in the linguistics doc, or else use ngram instead of tokenization, see https://docs.vespa.ai/documentation/reference/schema-reference.html#gram
n-gram is often a good choice with Vespa because it doesn't suffer from the recall problems of CJK tokenization, and by using a ranking model which incorporates proximity (such as e.g nativeRank) you'l still get good relevancy.

Avoiding keyword stuffing in SOLR

I'm looking for a way to limit the effect (or eliminate it) of "keyword stuffing" in SOLR. (We're currently running a SOLR 6.2.0 server).
I've tried setting omitTermFreqAndPositions="true", but when I do that, some queries throw phrase query errors (specifically queries with search terms such as G1966B - likely due to word splitting and such). I could go down the road of disabling the word splitting and try to avoid the phrase query errors, but this is simply going to mess up more things than I'm trying to fix.
Does anyone have any suggestions on how to limit the affect of multiple keyword matches in a single field?
Example: If we have a description field with something like this:
BrandX 1200 Series G1924B LC/MSD SL XBC System.
This BrandX 1200 Series G1924B ( G 1924 B , G1924 B , G 1924B ) LC/MSD SL XBC >System is in excellent condition.
When someone does a search for "G1924B" I would like to avoid scoring this document higher just because it happens to have G1924B (or a variation of that) in there several times.
In theory someone could repeat the keyword many times in their description to try to trick the system into ranking their search results higher.
Any suggestions?
Thanks!
This happens to appear as a more frequent requirement than initially thought.
If you remove both term freq and positions, you lose phrase search capability.
I would recommend to write a custom similarity that ignores TF ( Term Frequency).
At the moment the default BM25 take TF in consideration.
You can just pick that class and adjust the similarity calculus to consider TF as a constant.
e.g.
org.apache.lucene.search.similarities.BM25Similarity.BM25DocScorer#score
[1] org.apache.lucene.search.similarities.BM25Similarity

Solr/Lucene query lemmatization with context

I have successfully implemented a Czech lemmatizer for Lucene. I'm testing it with Solr and it woks nice at the index time. But it doesn't work so well when used for queries, because the query parser doesn't provide any context (words before or after) to the lemmatizer.
For example the phrase pila vodu is analyzed differently at index time than at query time. It uses the ambiguous word pila, which could mean pila (saw e.g. chainsaw) or pít (the past tense of the verb "to drink").
pila vodu ->
Index time: pít voda
Query time: pila voda
.. so the word pila is not found and not highlighted in a document snippet.
This behaviour is documented at the solr wiki (quoted bellow) and I can confirm it by debugging my code (only isolated strings "pila" and "vodu" are passed to the lemmatizer).
... The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, ...
So my question is:
Is it possible to somehow change, configure or adapt the query parser so the lemmatizer would see the whole query string, or at least some context of individual words? I would like to have a solution also for different solr query parsers like dismax or edismax.
I know that there is no such issue with phrase queries like "pila vodu" (quotes), but then I would lose the documents without the exact phrase (e.g. documents with "pila víno" or even "pila dobrou vodu").
Edit - trying to explain / answer following question (thank you #femtoRgon):
If the two terms aren't a phrase, and so don't necessarily come together, then why would they be analyzed in context to one another?
For sure it would be better to analyze only terms coming together. For example at the indexing time, the lemmatizer detects sentences in the input text and it analyzes together only words from a single sentence. But how to achieve a similar thing at the query time? Is implementing my own query parser the only option? I quite like the pf2 and pf3 options of the edismax parser, would I have to implement them again in case of my own parser?
The idea behind is in fact a bit deeper because the lemmatizer is doing word-sense-disambiguation even for words that has the same lexical base. For example the word bow has about 7 different senses in English (see at wikipedia) and the lemmatizer is distinguishing such senses. So I would like to exploit this potential to make searches more precise -- to return only documents containing the word bow in the concrete sense required by the query. So my question could be extended to: How to get the correct <lemma;sense>-pair for a query term? The lemmatizer is very often able to assign the correct sense if the word is presented in its common context, but it has no chance when there is no context.
Finally, I implemented my own query parser.
It wasn't that difficult thanks to the edismax sources as a guide and a reference implementation. I could easily compare my parser results with the results of edismax...
Solution :
First, I analyze the whole query string together. This gives me the list of "tokens".
There is a little clash with stop words - it is not that easy to get tokens for stop words as they are omitted by the analyzer, but you can detect them from PositionIncrementAttribute.
From "tokens" I construct the query in the same way as edismax do (e.g. creating all 2-token and/or 3-token phrase queries combined in DisjunctionMaxQuery instances).

Searching for words that are contained in other words

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks
We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.
perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

Multiword synonyms and odd token order

Why is the synonymTokenFilter putting the expanded term right after the match of the first token in a multiword synonym? While I'm using elasticsearch, this certainly would apply to any solr/lucene gurus out there as well. I am only applying this during index time, but it is in conjunction with shingles, so the order is extremely important.
I have a synonym:
popcorn popper,popcorn machine
My synonymTokenFilter has expand=true via defaults in elasticsearch.
When I view my tokens, popcorn machine is always inserted between popcorn and popper regardless whether the input term is popcorn popper or popcorn machine.
Example analyzing "popcorn popper"
t1:Popcorn t2:popcorn t3:machine t4:popper
Example analyzing "popcorn machine"
t1:Popcorn t2:popcorn t3:machine t4:popper
The Lucene token stream is actually a graph. Things like synonyms really cause problems with that graph model and token offsets. Things are however improving in the newer Lucene versions. You just may need to look at (Solr and Lucene) Jiras to find the relevant discussions.

Resources