Finding the most common terms in my Solr collection - solr

I need to identify potential stopwords in my Solr collection. Is it possible to find those terms which have the highest document frequency in my collection (or at least in a given shard)?

Yes, use HighFreqTerms, like:
TermStats[] stats = HighFreqTerms.gethighFreqTerms(reader, 10, "myContentField", new HighFreqTerms.DocFreqComparator());
for (TermStats stat : stats) {
System.out.println(stat.termtext.utf8ToString() + ", docfreq:" + stat.docFreq);
//Or whatever else you want to do with them...
}
Luke also prominently displays the most common terms.

As you already set up Solr, use TermsComponent to get the term frequencies for any given field:
http://wiki.apache.org/solr/TermsComponent
If you have a default search field, (which is the destination of your copied field), it should give you the frequencies across all fields.

Related

Lucene comparing document contents

I am trying to compare the contents of documents using solr. I do this by simply using the entire document contents as a query. This works until the documents get large. A document can contain as many as 15k words or more. This results in a max boolean clause exception which has a default value of 1024. Now I could of course increase this value, but even if I increase it to 5k then it will remain impossible to compare documents with large contents.
Is Lucene even suitable for such tasks? And if so, what should I do to accomplish said requirements. If not, what would be an alternative way of comparing the contents of one document with other documents?
I think MoreLikeThis. MoreLikeThis prunes a documents contents to it's higher frequency terms, and just searches with those, which gets around the high numbers of terms (and improving performance). If you are searching for documents similar to an external source:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(someReader, "contents");
Hits hits = indexsearcher.search(query);
Or if searching for a document already in the index:
MoreLikeThis mlt = new MoreLikeThis(indexreader);
Query query = mlt.like(documentNumber);
Hits hits = indexsearcher.search(query);
Solr also includes a MoreLikeThis handler.

Solr Custom Boosting if a specific field matches the query

We are trying to implement a very interesting search logic with custom boosting and I am wondering if Solr can support this.
We have the following fields in our index:
Name
Description
Keywords (array)
Each keyword will have an amount(int value) paired to it.
A search is run across Name, description and keywords field. If a keyword matches the search text, the corresponding index must be boosted based on the amount of the matching keyword only.
I've read through Solr DisMax and they can only boost a field using a fixed amount.
My scenario will be to boost the result by X amount based on matching keywords only.
Thanks in advance
The only viable solution i see to this problem (assuming ofcourse you DO NOT know the number of keywords in advance) would be to just make the query as a filter query (to skip the scoring stage), get all documents matching ( a bit problematic), then just sort them on your side using the matched term to build the a java Comparator.
Problems may arise when you get a particularly large number of documents, but you could probably side step this issue by pagination
If you don't have too much different amounts maybe you can try this on index-time:
Store "keywords" in different fields(dynamicfields->boost-*) based on it's amount:
boost-1 = keyword1,keyword4,keyword6 <br/>
boost-10 = keyword2<br/>
boost-100 = keyword5
You can search across all your boost fields(edismax), boost every dynamicfield with his amount in your (e)dismax conf(boost-1^1,boost-10^10,boost-100^100).

How to boost AND in a solr query?

Suppose a user enters a two word input for search, since the default boolean applied is OR, all entries containing all or both entries appear.
What I was interested to know, is that if conditions specifically meeting the AND condition could be boosted.
In case of multiple words, can words be specified to imply specific constraints in searching or boost few parameters in case these words are present.For e.g: , if input be "with x and y without z", can i make my solr to interpret it as (x AND y) AND (Not z)? or at least boost those entries which partially or fully meet the requirement?
EDIT:
I have tried using boost with edismax as shown here:
$query = $client->createSelect(); //create search query
$query->setQuery('memberType:'.$searchQuery.' firstName:'.$searchQuery.' gender:'.$searchQuery); //include fields required for searching //meantion fields to be searched and search query/ies
$edismax = $query->getEDisMax();
$edismax->setQueryFields('firstName memberType^3 gender^2'); //boost fields
$query->setStart($start)->setRows($rows); //vary bracketted numbers to vary results staring point and no. of rows to be displayed, use variables instead of constants
$query->setFields(array('id', 'firstName', 'lastName', 'eid', 'gender', 'memberType')); //set return fields
//$query->addSort('id', $query::SORT_ASC); //sort field and customisations
$resultSet = $client->select($query);
When i search for a name with a particular member type, like "sanjay candidate" i expect the order to be entries with sanjay and candidate, and then all users who are candidates and then all users who are sanjay, but instead i get sanjay and candidate then all who are sanjay and then all candidates.
I am not able to figure out what the issue may be or if i can provide a more customized boosting.
If you are using eDismax, you have a whole collection of boosting options for a phrase, bigram, a separate boosting query and so on. Reading through the wiki page and experiment. You should not need to do any custom coding for this scenario.

Is it possible to user Solr TermsComponent to return the "n" most frequent indexed terms over a base query?

We want to be able to return the "n" most frequent indexed terms for certain documents selected from a base query. Is that possible using solar?
Yes, you can do this by turning faceting on and faceting on the field from which you're trying to get the frequently indexed terms. You might actually get more information then you need (Solr will return all terms ordered by frequency rather than the top n):
?q=keyword&facet=true&facet.field=myfield
If you use &rows=0 as well then Solr will return only the faceting information and not the actual search results as well.
EDIT: Actually, by default Solr returns the top 100 facet terms. Use the facet.limit parameter to change this number. So, to return the top n terms, do the following:
?q=keyword&facet=true&facet.field=myfield&facet.limit=n
Use a negative number for facet.limit to return all terms. More information here: http://wiki.apache.org/solr/SimpleFacetParameters

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources