Apache solr wild card searching with multiple words - solr

We are using apache solr with php.
There is a problem in wild card searching.
We want to search "project manage*" which can list possible results like project manager, project management etc. However, it is not working whenever there are two words in wild card searching
For example "projectmanage*" is working whereas "proejct manage*" is not working. We also tried by escaping the space but it is not working either..
Looking forward to all valuable inputs.. thanks in advance.

When applying a wild card, the regular analysis chain is not performed when querying. This results in Solr looking for tokens starting with with "project manage" - and if you have an analysis chain when indexing, your text is usually split into multiple tokens.
You can use a Shingle filter to index multiple tokens as a single token, which can be used to get around the issue (be sure to use the same separator as you use in your text).
Another option is to lowercase the field when indexing and querying and use a regular StrField which isn't processed in any way, or use a KeywordTokenizer - which keeps the indexed content as a single token.

Related

Azure search: Wild card queries does not work with japanese/chinese characters

I used icu_tokenizer using custom analyzer to create a search index for Japanese words. Index was created successfully. Using icu_tokenizer as for asian languages it works better than the default azure search tokenizer.
Now when I use query for string Ex:- 赤城 I see multiple search results (total 131) from the index. But when I use the wild card search with the same word, Ex: 赤城* (adding * at the end of the word) or /赤城.*/ (using regex search query) i see 0 search results. The weird part is that * seems to work with single japanese character 赤* gives me same number of search results as 赤 gives. But as soon as I increase the number of japanese characters from 1, wild card queries with * stops working and returns 0 search result. All of these queries I am testing it on search explorer on Azure portal using querytype=full (lucene syntax query)
In my application search terms are normally used as prefix search so normally we append * at the end of the search string to fetch search results but looks like these lucene wildcard queries with japanse characters just do not work. Any idea, how can I make these prefix queries (using wildcard * at end of search strings) work when search strings are given in japanese characters?
Any quick help will be much appreciated!!
I tested with my installation now and I can confirm that wildcards only work with Japanese content when you use a Japanese analyzer.
In my example I set up one index using a property Body that does not have a specific analyzer defined. Then I set up another index where Body uses the ja.microsoft language analyzer. The content in both indexes are identical. I then tried to search for 自動車 (automobile) with a trailing wildcard.
自動車* returns multiple hits from my index using the japanese analyzer. No hits are returned from the index without a specific analyzer defined.
sorry for the late reply.
Have you tried using one of the Japanese language analyzers? For example, ja.microsoft
Also, if you want to use prefix search, you can try experimenting with the suggester feature which is designed to be efficient for this scenario.

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

Removing additional, extra periods (dots) from tokens while indexing in solr

I want to remove extra periods between tokens when solr indexes documents.
I can always do this with custom code before indexing to solr. But is there a tokenizer or analyzer or configuration which will strip off unnecessary periods(dots)?
Example: This repair shop is very good... I would recommend it to anyone who wants to repair their bikes...Please give it a try.....
I have gone through multiple tokenizers and analyzers. None of them seem to work for this.
I am currently using solr.WhitespaceTokenizerFactory and solr.WordDelimiterFilterFactory along with few other filters.
Because of the way I am using WordDelimiterFilterFactory, solr is generating
good, good..., bikes..., bikes, bikesplease, try, try.....
I dont want solr to generate the tokens with ... at the end.
Any ideas on how to do it without writing custom code?.........
have you tried solr.StandardTokenizerFactory ?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
I tried this tokenizer and seems to work as you expected.

Sunspot/Solr: word concatenation

I'm using Solr with the Sunspot Ruby gem. It works great, but I'm noticing that sometimes users will get poor search results because they have concatenated their search terms (e.g. 'foolproof') where the document text was 'fool proof'. Or vice-versa.
I was going to try and address this by creating a set of alternate match fields by manually concatenating the words from the source documents together. This seems kind of hackish, and implementing the other side (breaking up user concatenations into words) is not obvious.
Is there a way to do this properly in Solr/Sunspot?
Did yo have a look at SOLR spellcheck (or spell check) component?
http://wiki.apache.org/solr/SpellCheckComponent
For example, there is a WordBreakSolrSpellChecker, which may provide valid suggestions in such case.

Apache Solr: Correct use of CompoundWordFilter

I'm trying to figure out how to best configure Solr for my app. I'm indexing (mostly german) PDF-Documents, and I'm using dismax queries to query Solr.
If a document contains the word "Firmenprofil" (a german compound word, -> 'company profile'), it will only be returned in queries for exactly that word. However, it would be desirable for queries only containing "Profil" to also return this document.
I downloaded a german dictionary file and applied a DictionaryCompoundWordTokenFilter to both the index- and the query-analyzer.
The Problem is, that the filter decomposes the query into very small parts (e.g. "pro" in the case of "Firmenprofil" which then results in having all sorts of documents that contain words like "Product" returned...).
I tried removing the Filter from the query-analyzer which leads to solr not finding the document at all. I also tried leaving the query-filter in, but explicitly setting the onlyLongestMatch-option to true, but that didn't seem to have any effect at all.
Ok, seems like my dictionary file was simply too big (~20mb). I replaced it with a more compact one and now it works just fine...
Without your actual config files, its a bit of a guessing game.
Did you check if profil is part of the dictionary?

Resources