SOLR index searching across word boundaries - solr

I am trying to configure a SOLR index of business names to be able to do business name lookups. Here is a use case that I'm trying to solve for:
My solr index contains "WHOLE FOODS MARKET". I have a string that I'm trying to look up that has some relevant information and some not relevant information: "WHOLEFDS TRB 10245".
Any help/pointers would be appreciated -- I'm a SOLR novice.

Take a look at the NGRAM filter within the example schema.xml within the zip distribution of solr.
Further links:
How to use n-grams approximate matching with Solr?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

Related

Solr- Find "Significant Terms" on Subset of Documents

I'm trying to get "significant terms" for a subset of documents in Solr. This may or may not be the best way, but I'm currently attempting to use Solr's TF-IDF functionality since we have the data stored in Solr and it's lightning fast. I want to restrict the "DF" count to a subset of my documents, through a search or a filter. I tried this, where I'm searching for "apple" in the name field:
http://localhost:8983/solr/techproducts/tvrh?q=name:apple&tv.tf=true&tv.df=true&tv.tf_idf=true&indent=on&wt=json&rows=1000
and that of course, only gives me documents that have "apple" in the name, but my document frequency gives the counts from the entire dataset, which doesn't seem like what I want. I would think Solr can do this, but maybe not. I'm open to suggestions.
Thanks,
Adrian
It is one the works I have in my backlog[1].
What you need is actually the document frequency in your foreground set ( your subset of docs) and the document frequency in your background set(your corpus).
Solr won't do that out of the box, but you can work on it.
Elastic Search has a module for that you can inspiration from[2]
[1] https://issues.apache.org/jira/browse/SOLR-9851
[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

Is there a way to get a list of all the words in the solr spellcheck index?

I'm using Solr's SpellCheckComponent with IndexBasedSpellChecker. Wondering if there's a way to get an output of all the words in the dictionary.
Might help us catch some of the misspellings on our site.
yes, there is. IndexBasedSpellChecker, according to the doc: "The IndexBasedSpellChecker uses a Solr index as the basis for a parallel index used for spell checking. It requires defining a field as the basis for the index terms "
So it just uses one field you choose from the index. To enumerate all terms on a field you use the Terms component and you set terms.fl to that field. If you have lots of terms, you could play do some scrolling with terms.lower, terms.limit and terms.upper to get the info in multiple calls.

how to implement solr index partitioning

I want solr to create indexes based on a specific field. For e.g. I have a field in schema.xml, createDate (which might be of value 2012/2013/etc). Now while indexing if the value of that specific field is 2013, the document should be indexed at /data/2013/index folder (or some logically separated folder). I tried to provide the following in my solrconfig xml just before the <config> tag ends:
<partition>
<partitionField name="creationYear">
<value>2004</value>
<value>2005</value>
<value>2006</value>
<value>2007</value>
<value>2008</value>
<value>2009</value>
<value>2010</value>
<value>2011</value>
<value>2012</value>
<value>2013</value>
</partitionField>
</partition>
While indexing its not working and it seems that this was just an idea but not really implemented in solr. Am I assuming correct? Or is there a way I can allow solr to create dynamic index folders based on the year(as in this example)?
Any help would be appreciated!!

How do I override Solr's relevancy in a query

I am integrating a chemical structure search with Solr. To that end I am creating a Solr plugin.
The structure search returns the structure_id and it's score. Scores are values between 100 and 0 (probably would never see a 0)
I use this to create a Solr query to pull all documents that have the structure_ids. I want the results of the search to be ordered by the structure search score, not the Solr relevancy.
I generate a query that looks like this:
+structure_id:(28760263^95 OR 30392284^82 OR 47390042^70)
The problem is that in my trivial test case Solr is returning the records matching the structure_id 28760263 last. It has assigned it the lowest relevancy (4.6609402E-6)!
I wrote a function to basically amplify the score by a lot and that apparently does fix the problem however I don't think that the amplification should be necessary.
I am using Solr 3.5.
Is there some configuration that I am missing? Currently I am using Solr pretty much out of the box. The only things I've changed is to add my plugin and I edited the example docs to add structure_ids for my test case.
Is there a way to completely override the lucene scoring with the score from the structure search? We have other reasons why we would like to take control of Solr's scoring and knowing how to do that would be useful

Can we index and search with different languages in the same Solr index?

I have data coming from an external system (in CSV form).
The data contains fields like :
id - french_title - english_title - french_desc - english_desc etc...
I know I can use multiple cores but is there a way to index and search this with just one core?
For example, can I tell Solr to use a French Analyzer on french_title and french_desc and an English analyzer on english_title and english_desc?
It should work since Solr lets you configure the analyzer to use on a per-field basis.

Resources