Removing additional, extra periods (dots) from tokens while indexing in solr - solr

I want to remove extra periods between tokens when solr indexes documents.
I can always do this with custom code before indexing to solr. But is there a tokenizer or analyzer or configuration which will strip off unnecessary periods(dots)?
Example: This repair shop is very good... I would recommend it to anyone who wants to repair their bikes...Please give it a try.....
I have gone through multiple tokenizers and analyzers. None of them seem to work for this.
I am currently using solr.WhitespaceTokenizerFactory and solr.WordDelimiterFilterFactory along with few other filters.
Because of the way I am using WordDelimiterFilterFactory, solr is generating
good, good..., bikes..., bikes, bikesplease, try, try.....
I dont want solr to generate the tokens with ... at the end.
Any ideas on how to do it without writing custom code?.........

have you tried solr.StandardTokenizerFactory ?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
I tried this tokenizer and seems to work as you expected.

Related

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

Apache solr wild card searching with multiple words

We are using apache solr with php.
There is a problem in wild card searching.
We want to search "project manage*" which can list possible results like project manager, project management etc. However, it is not working whenever there are two words in wild card searching
For example "projectmanage*" is working whereas "proejct manage*" is not working. We also tried by escaping the space but it is not working either..
Looking forward to all valuable inputs.. thanks in advance.
When applying a wild card, the regular analysis chain is not performed when querying. This results in Solr looking for tokens starting with with "project manage" - and if you have an analysis chain when indexing, your text is usually split into multiple tokens.
You can use a Shingle filter to index multiple tokens as a single token, which can be used to get around the issue (be sure to use the same separator as you use in your text).
Another option is to lowercase the field when indexing and querying and use a regular StrField which isn't processed in any way, or use a KeywordTokenizer - which keeps the indexed content as a single token.

Using solr 4.2 how do I use/enable fuzzy phrase searching

So right now I'm just using the admin interface to run search queries. I know that a tilde ~ suffix causes a word to become fuzzy search.
However, what about a phrase? I tried "some words"~ but it doesn't seem to be returning results when it should be. Any idea why? Do I need a special fieldtype or special filters?
Right now, everything is pretty vanilla but I did import a lot of data. (About 12 million rows). I know that there are things in there that should be getting returned with a good fuzzy match that are not.
Any help is appreciated.
Also, if it makes a difference I would like to use the levenshtein algorithm.
ComplexPhraseQueryParser can be used to handle wildcard and fuzzy phrase queries.

Sunspot/Solr: word concatenation

I'm using Solr with the Sunspot Ruby gem. It works great, but I'm noticing that sometimes users will get poor search results because they have concatenated their search terms (e.g. 'foolproof') where the document text was 'fool proof'. Or vice-versa.
I was going to try and address this by creating a set of alternate match fields by manually concatenating the words from the source documents together. This seems kind of hackish, and implementing the other side (breaking up user concatenations into words) is not obvious.
Is there a way to do this properly in Solr/Sunspot?
Did yo have a look at SOLR spellcheck (or spell check) component?
http://wiki.apache.org/solr/SpellCheckComponent
For example, there is a WordBreakSolrSpellChecker, which may provide valid suggestions in such case.

When enabled stemming, searching for the root word gives no hits

I have indexed a site with solr. It works very well if stemming is not enabled. Using stemming, however, solr does not return any hits when searching for the root of a word. I use Swedish stemming.
For example, searching for support gives hits if not using stemming. Using stemming, searching for support gives no hits. Though, searching for supporten returns hits that match support.
By debugging the query, I can see that it stems the word support to suppor (which is incorrect by the way, but that should not matter). However, having the word stemmed to suppor, I want it to search for matches with the the original query word as well.
I'd appreciate any help on this!
Afaik, there is no way to keep the original word when stemming...
I assume that you are using solr.SnowballPorterFilterFactory. Snowball algorithm is too aggressive.
You should try a Hunspell stemmer or maybe solr.SwedishLightStemFilterFactory.
A workaround you can do is to reformat your query into "support support*" or "support support~". * is wildcard matching and ~ is fuzzy matching using Lucene syntax. I know you didn't mention the need to do wildcard and fuzzy search, but I found under these circumstances, the stemming on query will not take effect, so "support" is preserved. And stemming will still be effective on the first word, so both results will be returned if any. Plus, fuzzy search will help reduce the tolerance of typos in users' queries, so it's an added benefit.

Resources